

### Pipelining & Verilog

- 6.UAP?
- Division
- Latency & Throughput
- Pipelining to increase throughput
- Retiming
- Verilog Math Functions

6 111 Fall 2017 Lecture 9

### Cyclic redundancy check - CRC

$$CRC16 (x16 + x15 + x2 + 1)$$



- Each "r" is a register, all clocked with a common clock. Common clock not shown
- As shown, for register r15, the output is r[15] and the input is the sum of r[14], r[15] and data input x16, etc
- The small round circles with the plus sign are adders implemented with XOR gates.
- Initialize r to 16'hFFFF at start

#### CRC Solution CRC16: x16+x15+x2+1 module lpset6( input clock, input start. input data, output done, output reg [15:0] r r[14] + r[15] + x16 parameter IDLE=0; parameter CRC CALC=1;

```
reg state=0:
reg [5:0] counter=0; //my counter
always @ (posedge clock) begin
   case (state)
    IDLE: begin
     state <= (start) ? CRC CALC : IDLE:
     r <= 16'hFFFF; //start reset FSM
     counter <= 47;
    CRC CALC: begin
      r[15] <= r[14] + r[15] + x16;
```

state <= (counter == 1)? IDLE:CRC CALC;

wire x16 = data:

endcase

end

r[14:3] <= r[13:2];

 $r[0] \ll r[15] + x16;$ 

counter <= counter - 1;

assign done = (counter == 0);

r[2] <= r[1] + r[15] + x16; r[1] <= r[0];



### Sequential Divider

Assume the Dividend (A) and the divisor (B) have N bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single subtraction at a time and then cycle the circuit N times. This circuit works on unsigned operands; for signed operands one can remember the signs, make operands positive, then correct sign of result.



```
Init: P \leftarrow 0, load A and B
Repeat N times
    shift P/A left one bit
    temp = P-B
    if (temp > 0)
       {P\leftarrow temp, A_{LSB}\leftarrow 1}
    else A_{LSB} \leftarrow 0
Done: Q in A, R in P
```

#### Sequential Divider

| P    | A    | 7/3 0111/11                                 |
|------|------|---------------------------------------------|
| 0000 | 0111 | Initial value                               |
| 0000 | 1110 | Shift                                       |
| 1101 |      | Use twos complement of 0011 for subtraction |
| 1101 |      | Subtract                                    |
| 0000 | 1110 | Restore, set A <sub>isb</sub> = 0           |
| 001  | 1100 | Shift                                       |
| 101  |      |                                             |
| 101  |      | Subtract                                    |
| 0001 | 1100 | Restore, set A <sub>isb</sub> = 0           |
| 0011 | 1000 | Shift                                       |
| 1101 |      |                                             |
| 0000 | 1001 | Subtact, set A <sub>isb</sub> = 1           |
| 0001 | 0010 | Shift                                       |
| 1101 |      |                                             |
| 110  |      | Subtract                                    |
| 0001 | 0010 | Restore, set A <sub>lsh</sub> = 0           |



```
\begin{split} & \text{Init: } P \leftarrow 0 \,, \; \text{load A and B} \\ & \text{Repeat N times } \left\{ \\ & \; \text{shift P/A left one bit} \\ & \; \text{temp} = P - B \\ & \; \text{if (temp > 0)} \\ & \; \left\{ P \leftarrow \text{temp, A}_{LSB} \leftarrow 1 \right\} \\ & \; \text{else A}_{LSB} \leftarrow 0 \\ & \left\} \\ & \text{Done: Q in A, R in P} \\ \end{split}
```

6.111 Fall 2017 Lecture 9

#### Verilog divider.v

```
// The divider module divides one number by another. It
                                                                                always @( posedge clk ) begin
// produces a signal named "ready" when the quotient output
                                                                                  del_ready <= !bit;
// is ready, and takes a signal named "start" to indicate
                                                                                   if( start ) begin
// the the input dividend and divider is ready.
                                                                                    bit = WIDTH;
// sign -- 0 for unsigned 1 for two complement
                                                                                    quotient = 0:
// It uses a simple restoring divide algorithm.
                                                                                    quotient temp = 0:
                                                                                    \label{eq:dividend_copy} \mbox{dividend_copy} = (! sign \ | \ | \ ! \mbox{dividend[WIDTH-1]}) \ ?
// http://en.wikipedia.org/wiki/Division_(digital)#Restoring_division
                                                                                              {1'b0.zeros.dividend}
                                                                                              {1'b0,zeros,~dividend + 1'b1};
module divider #(parameter WIDTH = 8)
(input clk, sign, start,
                                                                                    divider_copy = (!sign | | !divider[WIDTH-1]) ?
 input [WIDTH-1:0] dividend,
                                                                                                         {1'b0,divider,zeros}:
 input [WIDTH-1:0] divider,
                                                                                                         {1'b0,~divider + 1'b1,zeros};
 output reg [WIDTH-1:0] quotient,
 output [WIDTH-1:0] remainder;
                                                                                    negative_output = sign &&
 output ready);
                                                                                              ((divider[WIDTH-1] && !dividend[WIDTH-1])
                                                                                               | | (!divider[WIDTH-1] && dividend[WIDTH-1]));
 reg [WIDTH-1:0] quotient_temp;
 reg [WIDTH*2-1:0] dividend_copy, divider_copy, diff;
                                                                                   else if (bit > 0) begin
                                                                                   diff = dividend_copy - divider_copy;
 reg negative_output;
                                                                                    quotient temp = quotient temp << 1:
 wire [WIDTH-1:0] remainder = (!negative_output) ?
                                                                                    if(!diff[WIDTH*2-1]) begin
       dividend_copy[WIDTH-1:0] : ~dividend_copy[WIDTH-1:0] + 1'b1;
                                                                                      dividend_copy = diff;
                                                                                      quotient\_temp[0] = 1'd1;
 reg del_ready = 1;
                                                                                    quotient = (!negative_output) ?
 wire ready = (!bit) & ~del_ready;
                                                                                           quotient_temp :
                                                                                           ~quotient_temp + 1'b1;
 wire [WIDTH-2:0] zeros = 0;
                                                                                    divider_copy = divider_copy >> 1;
                                                                                    bit = bit - 1'b1;
 initial negative_output = 0;
                                                                               endmodule
```

L. Williams MIT '13

6.111 Fall 2017 Lecture 9 6

### Math Functions in Coregen



### Coregen Divider



#### Coregen Divider



#### Performance Metrics for Circuits

Circuit Latency (L): time between arrival of new input and generation

of corresponding output.

For combinational circuits this is just tpD.

Circuit Throughput (T): Rate at which new outputs appear.

For combinational circuits this is just  $1/t_{PD}$  or 1/L.

6.111 Fall 2017 10 Lecture 9

## Coregen Divider Latency



Figure 2: Latency Example (Clocks per Division = 4)

Table 4: Latency of Fixed-point Solution Based on Divider Parameters

| Signed | Fractional | Clks/Div | Latency |
|--------|------------|----------|---------|
| False  | False      | 1        | M+2     |
| False  | False      | >1       | M+3     |
| False  | True       | 1        | M+F+2   |
| False  | True       | >1       | M+F+3   |
| True   | False      | 1        | M+4     |
| True   | False      | >1       | M+5     |
| True   | True       | 1        | M+F+4   |
| True   | True       | >1       | M+F+5   |
|        |            |          |         |

Latency dependent on dividend width + fractioanl reminder width

Note: M=dividend width, F=fractional remainder width.

The divclk\_sel parameter allows a range of choices of throughput versus area. With divclk\_sel = 1, the core is fully pipelined, so it will have maximal throughput of one division per clock cycle, but will occupy the most area. The divclk\_sel selections of 2, 4 and 8 reduce the throughput by those respective factors for smaller core sizes.

#### Performance of Combinational Circuits



For combinational logic:

$$L = t_{PD},$$

$$T = 1/t_{PD}.$$

We can't get the answer faster, but are we making effective use of our hardware at all times?



F & G are "idle", just holding their outputs stable while H performs its computation

6,111 Fall 2017 Lecture 9 11 6,111 Fall 2017 12 Lecture 9

#### Retiming: A very useful transform

#### Retiming is the action of moving registers around in the system

Registers have to be moved from ALL inputs to ALL outputs or vice versa



Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of the edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa.



#### Benefits of retiming:

- Modify critical path delay
- Reduce total number of registers

6.111 Fall 2017 Lecture 9

# Retiming Combinational Circuits aka "Pipelining"



6.111 Fall 2017 Lecture 9 14

#### Pipeline diagrams



The results associated with a particular set of input data moves *diagonally* through the diagram, progressing through one pipeline stage each clock cycle.

### Pipeline Conventions

#### **DEFINITION:**

a K-Stage Pipeline ("K-pipeline") is an acyclic circuit having exactly K registers on every path from an input to an output.

a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline.

#### CONVENTION:

Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input).

#### ALWAYS:

The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register  $t_{PD}$  PLUS (output) register  $t_{SETUP}$ .

The LATENCY of a K-pipeline is K times the period of the clock common to all registers.

The THROUGHPUT of a K-pipeline is the frequency of the clock.

16

6.111 Fall 2017 Lecture 9 15 6.111 Fall 2017 Lecture 9

#### Ill-formed pipelines

Consider a BAD job of pipelining:



For what value of K is the following circuit a K-Pipeline? \_\_\_\_\_ none

#### Problem:

Successive inputs get mixed: e.g.,  $B(A(X_{i+1}), Y_i)$ . This happened because some paths from inputs to outputs have 2 registers, and some have only 1!

This CAN'T HAPPEN on a well-formed K pipeline!

6.111 Fall 2017 Lecture 9 17

#### A pipelining methodology

#### Step 1:

Add a register on each output.

#### Step 2:

Add another register on each output. Draw a cut-set contour that includes all the new registers and some part of the circuit. Retime by moving regs from all outputs to all inputs of cut-set.

Repeat until satisfied with T.

#### STRATEGY:

Focus your attention on placing pipelining registers around the slowest circuit elements (BOTTLENECKS).



6.111 Fall 2017 18 Lecture 9

### Pipeline Example



|         | LATENCY | THROUGHPU |  |
|---------|---------|-----------|--|
| 0-pipe: | 4       | 1/4       |  |
| 1-pipe: | 4       | 1/4       |  |
| 2-pipe: | 4       | 1/2       |  |
| 3-pipe: | 6       | 1/2       |  |

#### **OBSERVATIONS:**

- 1-pipeline improves neither L or T.
- T improved by breaking long combinational paths, allowing faster clock.
- Too many stages cost L. don't improve T.
- Back-to-back registers are often required to keep pipeline wellformed.

### Pipeline Example - Verilog

Lecture 9



etc No pipeline

6,111 Fall 2017

assign y = G(x);// logic for y assign pixel = C(y) // logic for pixel

Pipeline always @(posedge clock) begin // pipeline y  $y2 \ll G(x);$ pixel <= C(y2)// pipeline pixel end

Lab 3 Pong

- G = game logic 8ns tpd
- C = draw round puck, use multiply with 9ns tpd
- · System clock 65mhz = 15ns period - opps

reg [N:0] x,y; reg [23:0] pixel always @ \* begin y=G(x);pixel = C(y);end

Latency = 2 clock cyles! Implications?

20

### Increasing Throughput: Pipelining



6.111 Fall 2017 Lecture 9 21

### How about $t_{PD} = 1/2t_{PD,FA}$ ?



6.111 Fall 2017 Lecture 9 22

### Timing Reports



### History of Computational Fabrics

- Discrete devices: relays, transistors (1940s-50s)
- Discrete logic gates (1950s-60s)
- Integrated circuits (1960s-70s)
  - □ e.g. TTL packages: Data Book for 100's of different parts
- Gate Arrays (IBM 1970s)
  - Transistors are pre-placed on the chip & Place and Route software puts the chip together automatically – only program the interconnect (mask programming)
- Software Based Schemes (1970's- present)
  - Run instructions on a general purpose core
- Programmable Logic (1980's to present)
  - ☐ A chip that be reprogrammed after it has been fabricated
  - ☐ Examples: PALs, EPROM, EEPROM, PLDs, FPGAs
  - □ Excellent support for mapping from Verilog
- ASIC Design (1980's to present)
  - ☐ Turn Verilog directly into layout using a library of standard cells
  - ☐ Effective for high-volume and efficient use of silicon area

6.111 Fall 2017 Lecture 9 23 6.111 Fall 2017 Lecture 9 24

#### Reconfigurable Logic



#### Programmable Array Logic (PAL)

- Based on the fact that any combinational logic can be realized as a sum-of-products
- PALs feature an array of AND-OR gates with programmable interconnect



6.111 Fall 2017 Lecture 9 26

# RAM Based Field Programmable Logic - Xilinx



Lecture 9

27

6,111 Fall 2017

### LUT Mapping

- N-LUT direct implementation of a truth table: any function of n-inputs.
- $\bullet$  N-LUT requires  $2^N$  storage elements (latches)
- N-inputs select one latch location (like a memory)



28

6.111 Fall 2017 Lecture 9

### Configuring the CLB as a RAM



#### Xilinx 4000 Interconnect



Figure 28: Single- and Double-Length Lines, with Programmable Switch Matrices (PSMs)

6.111 Fall 2017 Lecture 9

#### Xilinx 4000 Interconnect Details

6.111 Fall 2017



31

#### Add Bells & Whistles



Courtesy of David B. Parlour, ISSCC 2004 Tutorial, "The Reality and Promise of Reconfigurable Computing in Digital Signal Processing"

6.111 Fall 2017 Lecture 9 32

### The Virtex II CLB (Half Slice Shown)



6.111 Fall 2017 Lecture 9 33

### Adder Implementation



6.111 Fall 2017 Lecture 9 34

#### FPGA's



|              | CLB     | Dist RAM   | Block RAM   | Multipliers   |
|--------------|---------|------------|-------------|---------------|
| Virtex 2     | 8,448   | 1,056 kbit | 2,592 kbit  | 144 (18x18)   |
| Virtex 6     | 667,000 | 6,200 kbit | 22,752 kbit | 1,344 (25x18) |
| Spartan 3E   | 240     | 15 kbit    | 72 kbit     | 4 (18x18)     |
| Artix-7 A100 | 7,925   | 1,188 kbit | 4,860 kbit  | 240 (25x18)   |

## Design Flow - Mapping

- Technology Mapping: Schematic/HDL to Physical Logic units
- Compile functions into basic LUT-based groups (function of target architecture)



36

```
always @(posedge clock or negedge reset)
begin
if (! reset)
    q <= 0;
else
    q <= (a&b&c)||(b&d);
end</pre>
```

6.111 Fall 2017 Lecture 9 35 6.111 Fall 2017 Lecture 9

#### Design Flow - Placement & Route

• Placement - assign logic location on a particular device



 Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path delay – can take hours or days for large, dense designs



Challenge! Cannot use full chip for reasonable speeds (wires are not ideal).

Typically no more than 50% utilization.

6.111 Fall 2017 Lecture 9

#### Example: Verilog to FPGA



6.111 Fall 2017 Lecture 9 38

#### How are FPGAs Used?

#### **Logic Emulation**





FPGA-based Emulator (courtesy of IKOS)

- Prototyping
  - □ Ensemble of gate arrays used to emulate a circuit to be manufactured
  - □ Get more/better/faster debugging done than with simulation
- Reconfigurable hardware
  - One hardware block used to implement more than one function
- Special-purpose computation engines
  - □ Hardware dedicated to solving one problem (or class of problems)
  - Accelerators attached to general-purpose computers (e.g., in a cell phone!)

#### Summary

- FPGA provide a flexible platform for implementing digital computing
- A rich set of macros and I/Os supported (multipliers, block RAMS, ROMS, high-speed I/O)
- A wide range of applications from prototyping (to validate a design before ASIC mapping) to high-performance spatial computing
- Interconnects are a major bottleneck (physical design and locality are important considerations)

6.111 Fall 2017 Lecture 9 39 6.111 Fall 2017 Lecture 9

37

#### Nexys 4 - DDR



6.111 Fall 2017 Lecture 7 41

### Nexy4 Input Output

```
module labkit(
   input CLK100MHZ,
   input[15:0] SW,
   input BTNC, BTNU, BTNL, BTNR, BTND,
   output[3:0] VGA R,
   output[3:0] VGA_B,
   output[3:0] VGA G,
   output[7:0] JA,
   output VGA_HS,
   output VGA_VS,
   output LED16_B, LED16_G, LED16_R,
   output LED17_B, LED17_G, LED17_R,
   output[15:0] LED,
   output[7:0] SEG, // segments A-G (0-6), DP (7)
   output[7:0] AN
                    // Display 0-7
   );
   assign data = \{28 \text{ 'h0123456}, SW[3:0]\}; // \text{ display 0123456} + SW
```

6.111 Fall 2017 Lecture 9 42

#### XDC File

- $\label{local_set_property} $$ dict { PACKAGE_PIN N17 | IOSTANDARD LVCMOS33 } [get_ports { BTNC }]; $$ \#IO_L9P_T1_DQS_14 Sch=btnc $$$
- $set\_property dict \{ PACKAGE\_PIN M18 \quad IOSTANDARD \ LVCMOS33 \} [get\_ports \{ BTNU \}]; \\ \#IO\_L4N\_T0\_D05\_14 \ Sch=btnu \\$
- $\label{eq:set_property} $$ \sec_property dict { PACKAGE_PIN P17 IOSTANDARD LVCMOS33 } [get_ports { BTNL }]; $$ \#IO_L12P_T1_MRCC_14 Sch=btnl $$$
- $set\_property dict \{ PACKAGE\_PIN M17 \quad IOSTANDARD \ LVCMOS33 \} [get\_ports \{ BTNR \}]; \\ \#IO\_L10N\_T1\_D15\_14 \ Sch=btnr$
- $set\_property dict \{ PACKAGE\_PIN P18 \quad IOSTANDARD \ LVCMOS33 \} [get\_ports \{ BTND \}]; \\ \#IO\_L9N\_T1\_DQS\_D13\_14 \ Sch=btnd$

#### Dashboard



6.111 Fall 2017 Lecture 9 43 6.111 Fall 2017 Lecture 9

#### Loading Nexys4 Flash

- 1. Format a flash drive to have 1 fat32 partition
- 2. In vivado, click generate bitstream and afterwards do file->Export->Export\_Bitstream\_File to flash top-level directory
- 3. On the nexys 4, switch jumper JP1 to be on the USB/SD mode
- 4. Plug the usb stick into the nexys 4 while it's off and then power on. A yellow LED will flash while the bitstream is being loaded. When it's done, the green DONE led will turn on
- 5. You can remove the usb drive after your code is running

6.111 Fall 2017 Lecture 9 45



#### Vivado Simulation



6.111 Fall 2017 Lecture 9

#### Test Bench

```
module sample tf;
   // Inputs
                                                      module sample(
   reg bit in;
                                                         Tinput bit in,
   reg [3:0] bus_in;
                                                         input [3:0] bus_in,
   // Outputs
                                                         Toutput out_bit,
   wire out_bit;
                                                         output [7:0] out_bus
   wire [7:0] out_bus;
                                                         . . . Verilog . . .
   // Instantiate the Unit Under Test (UUT)
                                                      endmodule
   sample uut (
      .bit_in(bit_in),
      .bus_in(bus_in),
      .out bit(out bit).
      .out_bus(out_bus)
   initial begin
      // Initialize Inputs
      bit_in = 0;
                                                   inputs must be initialized
      bus_in = 0;
      // Wait 100 ns for global reset to finish
      #100:
      // Add stimulus here
    end
endmodule
```

6.111 Fall 2017 Lecture 9