## Pipelining & Verilog - Division - Latency & Throughput - Pipelining to increase throughput - Verilog Math Functions - Simulations 6.111 Fall 2019 Lecture 9 ## Sequential Divider Assume the Dividend (A) and the divisor (B) have N bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single subtraction at a time and then cycle the circuit N times. This circuit works on unsigned operands; for signed operands one can remember the signs, make operands positive, then correct sign of result. Init: $P \leftarrow 0$ , load A and B Repeat N times { shift P/A left one bit temp = P-Bif (temp >= 0) $\{P\leftarrow temp, A_{LSB}\leftarrow 1\}$ else $A_{LSB} \leftarrow 0$ Done: Q in A, R in P #### Table 7-9: Supported Expressions | Expression | Symbol | Status | | |------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | Concatenation | 0 | Supported | | | Replication | (0) | Supported | | | Arithmetic | +, -, *,** | Supported | | | Division | I. | Supported only if the second<br>operand is a power of 2, or both<br>operands are constant. | | | Modulus | % | Supported only if second operand<br>is a power of 2. | | | Addition | + | Supported | | | Subtraction | | Supported | | | Multiplication | • | Supported | | | Power | Both with non- If the the system Vivas supp commit result error The (high | Supported: Both operands are constants, with the second operand being non-negative. If the first operand is a 2, then the second operand can be a variable. Vivado synthesis does not support the real data type. Any combination of operands that results in a real type causes an error. The values X (unknown) and Z (high impedance) are not allowed. | | | Relational | >, <, >=, <= | Supported | | | Logical Negation | ! | Supported | | | Logical AND | 8:8: | Supported | | https://www.xilinx.com/support/documentation/sw\_manuals/xilinx2019\_1/ug901-vivado-synthesis.pdf 6.111 Fall 2019 Lecture 9 #### Sequential Divider | P | Α | P-B | 7/3 0111/11 B=0011 | |------|--------------------|-----|-----------------------------------| | 0000 | 0111 | | Initial value | | 0000 | 1110 | | Shift | | 0000 | | -3 | Subtract | | 0000 | 1110 | | Restore, set A <sub>lsb</sub> = 0 | | 0001 | 1100 | | Shift | | 0001 | | -2 | Subtract | | 0001 | 1100 | | Restore, set A <sub>lsb</sub> = 0 | | 0011 | 1000 | | Shift | | 0011 | | 0 | Subtract | | 0000 | 100 <mark>1</mark> | | Subtact, set A <sub>Isb</sub> = 1 | | 0001 | 0010 | | Shift | | 0001 | | -2 | Subtract | | 0001 | 0010 | | Restore, set A <sub>lsb</sub> = 0 | | R | Q | | | ``` Init: P \leftarrow 0, load A and B Repeat N times { shift P/A left one bit temp = P-B if (temp >= 0) \{P\leftarrow temp, A_{LSB}\leftarrow 1\} else A_{LSB} \leftarrow 0 Done: Q in A, R in P ``` 6.111 Fall 2019 6.111 Fall 2019 Lecture 9 Lecture 9 #### Sequential Divider | P | Α | P-B | 0001/0000 | |------|------|-----|-----------------------------------| | 0000 | 0001 | | Initial value | | 0000 | 0010 | | Shift | | 0000 | | 0 | Subtract | | 0000 | 0011 | | Subtact, set A <sub>Isb</sub> = 1 | | 0000 | 0110 | | Shift | | 0000 | | 0 | Subtract | | 0000 | 0111 | | Subtact, set A <sub>Isb</sub> = 1 | | 0000 | 1110 | | Shift | | 0000 | | 0 | Subtract | | 0000 | 1111 | | Subtact, set A <sub>Isb</sub> = 1 | | 0000 | 1110 | | Shift | | 0000 | | 0 | Subtract | | 0000 | 1111 | | Subtact, set A <sub>lsb</sub> = 1 | | R | Q | | | ``` \begin{split} & \text{Init: } P \leftarrow 0 \,, \; \text{load A and B} \\ & \text{Repeat N times } \left\{ \\ & \; \text{shift P/A left one bit} \\ & \; \text{temp} = P - B \\ & \text{if (temp} >= 0) \\ & \; \left\{ P \leftarrow \text{temp}, \; A_{\text{LSB}} \leftarrow 1 \right\} \\ & \; \text{else } A_{\text{LSB}} \leftarrow 0 \\ & \left\} \\ & \text{Done: } Q \; \text{in A, R in P} \\ \end{split} ``` 6.111 Fall 2019 Lecture 9 #### Verilog divider.v ``` // The divider module divides one number by another. It always @( posedge clk ) begin // produces a signal named "ready" when the quotient output del_ready <= !bit; // is ready, and takes a signal named "start" to indicate if( start ) begin // the the input dividend and divider is ready. // sign -- 0 for unsigned 1 for two complement bit = WIDTH; quotient = 0: // It uses a simple restoring divide algorithm. quotient temp = 0: \label{eq:dividend_copy} \mbox{dividend_copy} = (! sign \ | \ | \ ! \mbox{dividend[WIDTH-1]}) \ ? // http://en.wikipedia.org/wiki/Division_(digital)#Restoring_division {1'b0.zeros.dividend} module divider #(parameter WIDTH = 8) {1'b0.zeros.~dividend + 1'b1}: (input clk, sign, start, divider_copy = (!sign | | !divider[WIDTH-1]) ? input [WIDTH-1:0] dividend, {1'b0,divider,zeros}: input [WIDTH-1:0] divider. {1'b0,~divider + 1'b1,zeros}; output reg [WIDTH-1:0] quotient, output [WIDTH-1:0] remainder; negative_output = sign && output ready); ((divider[WIDTH-1] && !dividend[WIDTH-1]) | | (!divider[WIDTH-1] && dividend[WIDTH-1])); reg [WIDTH-1:0] quotient_temp; reg [WIDTH*2-1:0] dividend_copy, divider_copy, diff; else if (bit > 0) begin diff = dividend_copy - divider_copy; reg negative_output; quotient temp = quotient temp << 1: wire [WIDTH-1:0] remainder = (!negative_output) ? if(!diff[WIDTH*2-1]) begin dividend_copy[WIDTH-1:0] : ~dividend_copy[WIDTH-1:0] + 1'b1; dividend_copy = diff; quotient_temp[0] = 1'd1 reg del_ready = 1; quotient = (!negative_output) ? wire ready = (!bit) & ~del_ready; quotient_temp: ~quotient_temp + 1'b1; wire [WIDTH-2:0] zeros = 0; divider_copy = divider_copy >> 1; bit = bit - 1'b1; initial negative_output = 0; endmodule ``` L. Williams MIT '13 6.111 Fall 2019 Lecture 9 6 ## Math & Other Functions in IP Catalog Wide selection of math functions available #### **Divider Generator** https://www.xilinx.com/support/documentation/ip\_documentation/div\_gen/v5\_1/pg151-div-gen.pdf 6.111 Fall 2019 Lecture 9 7 6.111 Fall 2019 Lecture 9 #### **IP Catalog Divider** #### Coregen Divider ## **Performance Metrics for Circuits** Circuit Latency (L): time between arrival of new input and generation of corresponding output. For combinational circuits this is just t<sub>PD</sub>. Circuit Throughput (T): Rate at which new outputs appear. For combinational circuits this is just $1/t_{PD}$ or 1/L. ## Coregen Divider Latency Latency dependent on dividend width + fractioanl reminder width Table 2-1: Latency of Radix-2 Solution Based on Divider Parameters | Signed | Fractional | Clocks Per Division | Fully Pipelined Latency <sup>(1)</sup> | |--------|------------|---------------------|----------------------------------------| | FALSE | FALSE | 1 | M+A+2 | | FALSE | FALSE | >1 | M+A+3 | | FALSE | TRUE | 1 | M+F+A+2 | | FALSE | TRUE | >1 | M+F+A+3 | | TRUE | FALSE | 1 | M+A+4 | | TRUE | FALSE | >1 | M+A+5 | | TRUE | TRUE | 1 | M+F+A+4 | | TRUE | TRUE | >1 | M+F+A+5 | 12 #### Notes: 1. M = Dividend and Quotient Width, F = Fractional Width, A = total Latency of AXI interfaces. #### Performance of Combinational Circuits For combinational logic: $$L = t_{PD},$$ $$T = 1/t_{PD}$$ We can't get the answer faster, but are we making effective use of our hardware at all times? F & G are "idle", just holding their outputs stable while H performs its computation 6.111 Fall 2019 Lecture 9 13 ## Retiming: A very useful transform #### Retiming is the action of moving registers around in the system Registers have to be moved from ALL inputs to ALL outputs or vice versa Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of the edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa. # Retiting Synchronous Chronity Charte E Gomes and James Stone Stage (S. 198). #### Benefits of retiming: - · Modify critical path delay - Reduce total number of registers 6.111 Fall 2019 Lecture 9 14 ## Retiming Combinational Circuits aka "Pipelining" ## Pipeline diagrams The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle. #### **Pipeline Conventions** #### **DEFINITION:** a K-Stage Pipeline ("K-pipeline") is an acyclic circuit having exactly K registers on every path from an input to an output. a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. #### CONVENTION: Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). #### ALWAYS: The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register $t_{\text{PD}}$ PLUS (output) register $t_{\text{SETUP}}$ . The LATENCY of a K-pipeline is K times the period of the clock common to all registers. The THROUGHPUT of a K-pipeline is the frequency of the clock. 6.111 Fall 2019 Lecture 9 #### III-formed pipelines Consider a BAD job of pipelining: For what value of K is the following circuit a K-Pipeline? none #### Problem: Successive inputs get mixed: e.g., $B(A(X_{i+1}), Y_i)$ . This happened because some paths from inputs to outputs have 2 registers, and some have only 1! This CAN'T HAPPEN on a well-formed K pipeline! 6.111 Fall 2019 Lecture 9 18 ## A pipelining methodology #### Step 1: Add a register on each output. #### Step 2: Add another register on each output. Draw a cut-set contour that includes all the new registers and some part of the circuit. Retime by moving regs from all outputs to all inputs of cut-set. Repeat until satisfied with T. #### STRATEGY: Focus your attention on placing pipelining registers around the slowest circuit elements (BOTTLENECKS). ### Pipeline Example | | LATENCY | THROUGHPUT | |---------|---------|------------| | 0-pipe: | 4 | 1/4 | | 1-pipe: | 4 | 1/4 | | 2-pipe: | 4 | 1/2 | | 3-pipe: | 6 | 1/2 | #### OBSERVATIONS: - 1-pipeline improves neither L or T. - T improved by breaking long combinational paths, allowing faster clock. - Too many stages cost L, don't improve T. - Back-to-back registers are often required to keep pipeline well-formed. 6.111 Fall 2019 Lecture 9 19 6.111 Fall 2019 Lecture 9 #### Pipeline Example - Verilog assign y = G(x); end ``` Lab 3 Pong ``` - G = game logic 8ns tpd - C = draw fancy object puck, lots of multiplies with 9ns tpd - System clock 65mhz = 15ns period – opps reg [N:0] x,y; reg [23:0] pixel ``` assign pixel = C(y) // logic for pixel always @ * begin y=G(x); pixel = C(y); end ``` // logic for y # Latency = 2 clock cyles! Implications? ``` always @(posedge clock) begin ... y2 <= G(x); // pipeline y pixel <= C(y2) // pipeline pixel</pre> ``` 6.111 Fall 2019 Lecture 9 ## Pipeline Example – Lab 3 ``` // calculate rom address and read the location assign image_addr = (hcount_in-x_in) + (vcount_in-y_in) * WIDTH; image_rom rom1(.clka(pixel_clk_in), .addra(image_addr), .douta(image_bits)); red_coe rcm (.clka(pixel_clk_in), .addra(image_bits), .douta(red_mapped)); always @ (posedge pixel_clk) begin if ((hcount_in >= x && hcount_in < (x_in+WIDTH)) &&. (vcount_in >= y_in && vcount_in < (y_in+HEIGHT))) pixel_out <= {red_mapped[7:4], red_mapped[7:4], red_mapped[7:4]}; // greyscale else pixel_out <= 0; end ``` 6.111 Fall 2019 Lecture 9 22 ## Pipeline Example – Lab 3 // calculate rom address and read the location assign image\_addr = (hcount\_in-x\_in) + (vcount\_in-y\_in) \* WIDTH; image\_rom rom1(.clka(pixel\_clk\_in), .addra(image\_addr), .douta(image\_bits)); red\_coe rcm (.clka(pixel\_clk\_in), .addra(image\_bits), .douta(red\_mapped)); always @ (posedge pixel\_clk) begin if ((hcount\_in >= x && hcount\_in < (x\_in+WIDTH)) &&. (vcount\_in >= y\_in && vcount\_in < (y\_in+HEIGHT))) pixel\_out <= {red\_mapped[7:4], red\_mapped[7:4], red\_mapped[7:4]}; // greyscale else pixel\_out <= 0; end 6.111 Fall 2019 Lecture 9 23 6.111 Fall 2019 Lecture 9 24 #### Increasing Throughput: Pipelining 6.111 Fall 2019 Reconfigurable Logic How about $t_{PD} = 1/2t_{PD,FA}$ ? → = register 6.111 Fall 2019 ## History of Computational Fabrics - Discrete devices: relays, transistors (1940s-50s) - Discrete logic gates (1950s-60s) - Integrated circuits (1960s-70s) - □ e.g. TTL packages: Data Book for 100's of different parts - Gate Arrays (IBM 1970s) - ☐ Transistors are pre-placed on the chip & Place and Route software puts the chip together automatically - only program the interconnect (mask programming) - Software Based Schemes (1970's- present) - □ Run instructions on a general purpose core - Programmable Logic (1980's to present) - □ A chip that be reprogrammed after it has been fabricated - □ Examples: PALs, EPROM, EEPROM, PLDs, FPGAs - Excellent support for mapping from Verilog - ASIC Design (1980's to present) - ☐ Turn Verilog directly into layout using a library of standard cells - ☐ Effective for high-volume and efficient use of silicon area #### Lecture 9 FA 🖙 ## Programmable Array Logic (PAL) - Based on the fact that any combinational logic can be realized as a sum-of-products - PALs feature an array of AND-OR gates with programmable interconnect 6.111 Fall 2019 Lecture 9 29 ## RAM Based Field Programmable Logic - FPGA ### FPGA RAM based Interconnect Lecture 9 Figure 28: Single- and Double-Length Lines, with Programmable Switch Matrices (PSMs) 6.111 Fall 2019 ## Xilinx Interconnect Details 31 6.111 Fall 2019 Lecture 9 ### Design Flow - Mapping - Technology Mapping: Schematic/HDL to Physical Logic units - Compile functions into basic LUT-based groups (function of target architecture) 6.111 Fall 2019 Lecture 9 33 ## Design Flow - Placement & Route • Placement – assign logic location on a particular device Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path delay – can take hours or days for large, dense designs Challenge! Cannot use full chip for reasonable speeds (wires are not ideal). Typically no more than 50% utilization. 6.111 Fall 2019 Lecture 9 34 #### Simulation - Five Options 6.111 Fall 2019 Lecture 9 ### Example: Verilog to FPGA How are FPGAs Used? #### Logic Emulation FPGA-based Emulator (courtesy of IKOS) #### Prototyping - Ensemble of gate arrays used to emulate a circuit to be manufactured - ☐ Get more/better/faster debugging done than with simulation #### Reconfigurable hardware - One hardware block used to implement more than one function - Special-purpose computation engines - □ Hardware dedicated to solving one problem (or class of problems) - □ Accelerators attached to general-purpose computers (e.g., in a cell phone!) 6.111 Fall 2019 Lecture 9 38 ## Summary Lecture 9 FPGA provide a flexible platform for implementing digital computing 6.111 Fall 2019 - A rich set of macros and I/Os supported (multipliers, block RAMS, ROMS, high-speed I/O) - A wide range of applications from prototyping (to validate a design before ASIC mapping) to high-performance spatial computing - Interconnects are a major bottleneck (physical design and locality are important considerations) ### Dashboard ## Loading Nexys4 Flash - 1. Format a flash drive to have 1 fat32 partition - 2. In vivado, click generate bitstream and afterwards do file->Export->Export\_Bitstream\_File to flash top-level directory - 3. On the nexys 4, switch jumper JP1 to be on the USB/SD mode - 4. Plug the usb stick into the nexys 4 while it's off and then power on. A yellow LED will flash while the bitstream is being loaded. When it's done, the green DONE led will turn on - 5. You can remove the usb drive after your code is running #### Test Bench ``` module sample_tb; module sample( // Inputs __input clk, logic clk; input data_in, logic data_in; output [7:0] data_out // Outputs wire [7:0] data_out; // Verilog // Instantiate the Unit Under Test (UUT) endmodule sample uut ( .clk(clk), .data_in(data_in), .data_out(data_out) always #5 clk = !clk; // create a clock inputs must be initialized initial begin // Initialize Inputs clk = 0; data_in = 0; // Wait 100 ns for global reset to finish #100; // Add stimulus here end ``` 6.111 Fall 2019 Lecture 9 41 6.111 Fall 2019 Lecture 9 42