#### Arithmetic Circuits & Multipliers - Addition, subtraction - Performance issues - -- ripple carry - -- carry bypass - -- carry skip - -- carry lookahead - Multipliers Reminder: Lab #3 due tonight! #### Signed integers: 2's complement 8-bit 2's complement example: $$11010110 = -2^7 + 2^6 + 2^4 + 2^2 + 2^1 = -128 + 64 + 16 + 4 + 2 = -42$$ If we use a two's complement representation for signed integers, the same binary addition mod 2<sup>n</sup> procedure will work for adding positive and negative numbers (don't need separate subtraction rules). The same procedure will also handle unsigned numbers! By moving the implicit location of "decimal" point, we can represent fractions too: $$1101.0110 = -2^3 + 2^2 + 2^0 + 2^{-2} + 2^{-3} = -8 + 4 + 1 + 0.25 + 0.125 = -2.625$$ #### Sign extension Consider the 8-bit 2's complement representation of: $$42 = 00101010$$ $-5 = \sim 00000101 + 1$ $= 11111010 + 1$ $= 11111011$ What is their 16-bit 2's complement representation? 42 = 000000000101010 #### Adder: a circuit that does addition Here's an example of binary addition as one might do it by "hand": Adding two N-bit numbers produces an $$(N+1)$$ -bit result $$\begin{array}{c} 1101 \\ 1101 \\ + 0101 \\ \hline 10010 \end{array}$$ If we build a circuit that implements one column: we can quickly build a circuit to add two 4-bit numbers... #### "Full Adder" building block The "half adder" circuit has only the A and B inputs $$S = A \oplus B \oplus C$$ $$CO = \overline{A}BC + A\overline{B}C + AB\overline{C} + ABC$$ $$= (\overline{A} + A)BC + (\overline{B} + B)AC + AB(\overline{C} + C)$$ $$= BC + AC + AB$$ #### Subtraction: A-B = A + (-B) Using 2's complement representation: -B = -B + 1= bit-wise complement So let's build an arithmetic unit that does both addition and subtraction. Operation selected by control input: 6.111 Fall 2016 #### Condition Codes Besides the sum, one often wants four other bits of information from an arithmetic unit: Z (zero): result is = 0 big NOR gate N (negative): result is < 0 $S_{N-1}$ C (carry): indicates an add in the most significant position produced a carry, e.g., 1111 + 0001 from last FA V (overflow): indicates that the answer has too many bits to be represented correctly by the result width, e.g., 0111 + 0111 $$V = A_{N-1}B_{N-1}\overline{S_{N-1}} + \overline{A_{N-1}}\overline{B_{N-1}}S_{N-1}$$ $$V = COUT_{N-1} \oplus CIN_{N-1}$$ To compare A and B, perform A-B and use condition codes: #### Signed comparison: LT N $\oplus$ V LE Z+(N $\oplus$ V) EQ Z NE $\sim$ Z GE $\sim$ (N $\oplus$ V) GT $\sim$ (Z+(N $\oplus$ V)) #### Unsigned comparison: LTU C LEU C+ZGEU $\sim C$ GTU $\sim (C+Z)$ #### Condition Codes in Verilog Z (zero): result is = 0 N (negative): result is < 0 C (carry): indicates an add in the most significant position produced a carry, e.g., 1111 + 0001 V (overflow): indicates that the answer has too many bits to be represented correctly by the result width, e.g., 0111 + 0111 ``` wire signed [31:0] a,b,s; wire z,n,v,c; assign {c,s} = a + b; assign z = ~|s; assign n = s[31]; assign v = a[31]^b[31]^s[31]^c; ``` Might be better to use sum-ofproducts formula for V from previous slide if using LUT implementation (only 3 variables instead of 4). #### Modular Arithmetic The Verilog arithmetic operators (+,-,\*) all produce full-precision results, e.g., adding two 8-bit numbers produces a 9-bit result. In many designs one chooses a "word size" (many computers use 32 or 64 bits) and all arithmetic results are truncated to that number of bits, i.e., arithmetic is performed modulo $2^{word \, size}$ . Using a fixed word size can lead to *overflow*, e.g., when the operation produces a result that's too large to fit in the word size. One can - Avoid overflow: choose a sufficiently large word size - •<u>Detect</u> overflow: have the hardware remember if an operation produced an overflow trap or check status at end - •Embrace overflow: sometimes this is exactly what you want, e.g., when doing index arithmetic for circular buffers of size 2N. - •"Correct" overflow: replace result with most positive or most negative number as appropriate, aka saturating arithmetic. Good for digital signal processing. #### Speed: tpD of Ripple-carry Adder $$C_O = AB + AC_I + BC_I$$ Worst-case path: carry propagation from LSB to MSB, e.g., when adding 11...111 to 00...001. $$t_{PD} = (N-1)^*(t_{PD,OR} + t_{PD,AND}) + t_{PD,XOR} \approx \Theta(N)$$ CI to CO CI<sub>N-1</sub> to S<sub>N-1</sub> $$t_{adder} = (N-1)t_{carry} + t_{sum}$$ ⊕(N) is read "order N": means that the latency of our adder grows at worst in proportion to the number of bits in the operands. 6.111 Fall 2016 Lecture 8 #### How about the $t_{PD}$ of this circuit? Is the $t_{PD}$ of this circuit = 2 \* $t_{PD,N-BIT\,RIPPLE}$ ? Nope! tpD of this circuit = tpD,N-BIT RIPPLE + tpD,FA!!! 6.111 Fall 2016 Lecture 8 #### Alternate Adder Logic Formulation How to Speed up the Critical (Carry) Path? (How to Build a Fast Adder?) | A | В | $C_{m{i}}$ | S | $C_{o}$ | Carry<br>status | |---|---|------------|---|---------|-----------------| | 0 | 0 | 0 | 0 | 0 | delete | | 0 | 0 | 1 | 1 | 0 | delete | | 0 | 1 | 0 | 1 | 0 | propagate | | 0 | 1 | 1 | 0 | 1 | propagate | | 1 | 0 | 0 | 1 | 0 | propagate | | 1 | 0 | 1 | 0 | 1 | propagate | | 1 | 1 | 0 | 0 | 1 | generate | | 1 | 1 | 1 | 1 | 1 | generate | Generate (G) = AB Propagate (P) = $A \oplus B$ $$C_o(G, P) = G + PC_i$$ $S(G, P) = P \oplus C_i$ Note: can also use P = A + B for $C_o$ #### Faster carry logic Let's see if we can improve the speed by rewriting the equations for $C_{OUT}$ : ``` module fa(input a,b,cin, output s,cout); wire g = a & b; wire p = a ^ b; assign s = p ^ cin; assign cout = g | (p & cin); endmodule ``` Actually, P is usually defined as $P = A^B$ which won't change $C_{OUT}$ but will allow us to express S as a simple function: $$S = P^C_{IN}$$ #### Virtex II Adder Implementation # Virtex II Carry Chain #### Carry Bypass Adder Key Idea: if $(P_0 P_1 P_2 P_3)$ then $C_{0,3} = C_{i,0}$ #### 16-bit Carry Bypass Adder # What is the worst case propagation delay for the 16-bit adder? Assume the following for delay each gate: P, G from A, B: 1 delay unit P, G, $C_i$ to $C_o$ or Sum for a C/S: 1 delay unit 2:1 mux delay: 1 delay unit #### Critical Path Analysis For the second stage, is the critical path: BP2 = 0 or BP2 = 1? Message: Timing analysis is very tricky -Must carefully consider data dependencies for <u>false paths</u> #### Carry Bypass vs Ripple Carry Ripple Carry: $$t_{adder} = (N-1) t_{carry} + t_{sum}$$ Carry Bypass: $$t_{adder} = 2(M-1) t_{carry} + t_{sum} + (N/M-1) t_{bypass}$$ M = bypass word size N = number of bits being added #### Carry Lookahead Adder (CLA) - Recall that $C_{OUT} = G + P C_{IN}$ where G = A&B and $P = A^B$ - For adding two N-bit numbers: $$C_{N} = G_{N-1} + P_{N-1}C_{N-1}$$ $$= G_{N-1} + P_{N-1}G_{N-2} + P_{N-1}P_{N-2}C_{N-2}$$ $$= G_{N-1} + P_{N-1}G_{N-2} + P_{N-1}P_{N-2}G_{N-3} + ... + P_{N-1}...P_{0}C_{IN}$$ $C_N$ in only 3 gate delays\*: 1 for P/G generation, 1 for ANDs, 1 for final OR \*assuming gates with N inputs • Idea: pre-compute all carry bits as $f(Gs,Ps,C_{IN})$ # Carry Lookahead Circuits #### The 74182 Carry Lookahead Unit # 74182 carry lookahead unit - high speed carry lookahead generator - used with 74181 to extend carry lookahead beyond 4 bits - correctly handles the carry polarity of the 181 $$\begin{split} C_{n+x} &= \overline{G0} \cdot \overline{P0} + \overline{G0} \cdot \overline{C}_n \\ &= \overline{G0} \cdot \overline{P0} \cdot \overline{G0} \cdot \overline{C}_n \\ &= (G0 + P0) \cdot (G0 + C_n) = G0 + P0C_n \\ &\triangleright C_4 = G_{3:0} + P_{3:0}C_n \\ C_{n+y} &= C_8 = G_{7:4} + P_{7:4}G_{3:0} + P_{7:4}P_{3:0}C_{i,0} = G_{7:0} + P_{7:0}C_n \\ C_{n+z} &= C_{12} = G_{11:8} + P_{11:8}G_{7:4} + P_{11:8}P_{7:4}G_{3:0} + P_{11:8}P_{7:4}P_{3:0}C_n \\ &= G_{11:0} + P_{11:0}C_n \end{split}$$ #### Block Generate and Propagate G and P can be computed for groups of bits (instead of just for individual bits). This allows us to choose the maximum fan-in we want for our logic gates and then build a hierarchical carry chain using these equations: $$C_{J+1} = G_{IJ} + P_{IJ}C_{I}$$ $$G_{IK} = G_{J+1,K} + P_{J+1,K}G_{IJ}$$ $$P_{IK} = P_{IJ}P_{J+1,K}$$ "generate a carry from bits I thru K if it is generated in the high-order (J+1,K) part of the block or if it is generated in the low-order (I,J) part of the block and then propagated thru the high part" where I < J and J+1 < K Hierarchical building block #### 8-bit CLA (P/G generation) From Hennessy & Patterson, Appendix A # 8-bit CLA (carry generation) ## 8-bit CLA (complete) #### Unsigned Multiplication Multiplying N-bit number by M-bit number gives (N+M)-bit result Easy part: forming partial products (just an AND gate since $B_{\rm I}$ is either 0 or 1) Hard part: adding M N-bit partial products #### Combinational Multiplier (unsigned) #### Combinational Multiplier (signed!) ## 2's Complement Multiplication (Baugh-Wooley) Step 1: two's complement operands so high order bit is $-2^{N-1}$ . Must sign extend partial products and subtract the last one Step 2: don't want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction into add of (complement + 1). ``` *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** *** *** *** *** *** *** *** *** *** *** *** ** *** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** * ``` Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away! Step 4: finish computing the constants... Result: multiplying 2's complement operands takes just about same amount of hardware as multiplying unsigned operands! #### Baugh Wooley Formulation - The Math no insight required Assuming X and Y are 4-bit twos complement numbers: $$X = -2^3 x_3 + \sum_{i=0}^{2} x_i 2^i$$ $Y = -2^3 y_3 + \sum_{i=0}^{2} y_i^2 2^i$ The product of X and Y is: $$XY = x_3 y_3 2^6 - \sum_{i=0}^{2} x_i y_3 2^{i+3} - \sum_{j=0}^{2} x_3 y_j 2^{j+3} + \sum_{i=0}^{2} \sum_{j=0}^{2} x_i y_j 2^{i+j}$$ For twos complement, the following is true: $$-\sum x_i^{3i}_{i=0} = -2^4 + \sum x_i^{3i}_{i=0} + 1$$ The product then becomes: $$\begin{aligned} XY &= x_3 y_3 2^6 + \sum_{i=0}^{2} x_i y_3 2^{i+3} + 2^3 - 2^6 + \sum_{j=0}^{2} x_3 y_j 2^{j+3} + 2^3 - 2^6 + \sum_{i=0}^{2} \sum_{j=0}^{2} x_i y_j 2^{i+j} \\ &= x_3 y_3 2^6 + \sum_{i=0}^{2} x_i y_3 2^{i+3} + \sum_{j=0}^{2} x_3 y_j 2^{j+3} + \sum_{i=0}^{2} \sum_{j=0}^{2} x_i y_j 2^{i+j} + 2^4 - 2^7 \\ &= -2^7 + x_3 y_3 2^6 + \overline{(x_2 y_3 + x_3 y_2)} 2^5 + \overline{(x_1 y_3 + x_3 y_1 + x_2 y_2 + 1)} 2^4 \\ &+ \overline{(x_0 y_3 + x_3 y_0 + x_1 y_2 + x_2 y_1)} 2^3 + (x_0 y_2 + x_1 y_1 + x_2 y_0) 2^{2} + (x_0 y_1 + x_1 y_0) 2^4 + (x_0 y_0) 2^0 \end{aligned}$$ ## 2's Complement Multiplication #### Multiplication in Verilog You can use the "\*" operator to multiply two numbers: ``` wire [9:0] a,b; wire [19:0] result = a*b; // unsigned multiplication! ``` If you want Verilog to treat your operands as signed two's complement numbers, add the keyword signed to your wire or reg declaration: ``` wire signed [9:0] a,b; wire signed [19:0] result = a*b; // signed multiplication! ``` Remember: unlike addition and subtraction, you need different circuitry if your multiplication operands are signed vs. unsigned. Same is true of the >>> (arithmetic right shift) operator. To get signed operations all operands must be signed. To make a signed constant: 10'sh37C #### Multiplication on the FPGA Hardware multiplier block: two 18-bit twos complement (signed) operands In the XC2V6000: 6 columns of mults, 24 in each column = 144 mults #### Sequential Multiplier Assume the multiplicand (A) has N bits and the multiplier (B) has M bits. If we only want to invest in a single N-bit adder, we can build a sequential circuit that processes a single partial product at a time and then cycle the circuit M times: ``` Init: P←0, load A and B Repeat M times { P ← P + (B<sub>LSB</sub>==1 ? A : 0) shift P/B right one bit } Done: (N+M)-bit result in P/B ``` #### Bit-Serial Multiplication ``` Init: P = 0; Load A,B Repeat M times { Repeat N times { shift A,P: Amsb = Alsb Pmsb = Plsb + Alsb*Blsb + C/0 } shift P,B: Pmsb = C, Bmsb = Plsb } (N+M)-bit result in P/B ``` # Combinational Multiplier (unsigned) # Useful building block: Carry-Save Adder Last stage is still a carry-propagate adder (CPA) ## Wallace Tree Multiplier This is called a 3:2 counter by multiplier hackers: counts number of 1's on the 3 inputs, outputs 2bit result. #### Wallace Tree: Combine groups of three bits at a time Higher fan-in adders can be used to further reduce delays for large M. 4:2 compressors and 5:3 counters are popular building blocks. # Wallace Tree \* Four Bit Multiplier Figure 11-35 Wallace tree for four-bit multiplier. \*Digital Integrated Circuits J Rabaey, A Chandrakasan, B Nikolic # Multiplication by a constant - If one of the operands is a constant, make it the multiplier (B in the earlier examples). For each "1" bit in the constant we get a partial product (PP) may be noticeably fewer PPs than in the general case. - For example, in general multiplying two 4-bit operands generates four PPs (3 rows of full adders). If the multiplier is say, 12 (4'b1100), then there are only two PPs: 8\*A+4\*A (only 1 row of full adders). - But lots of "1"s means lots of PPs... can we improve on this? - If we allow ourselves to subtract PPs as well as adding them (the hardware cost is virtually the same), we can re-encode arbitrarily long contiguous runs of "1" bits in the multiplier to produce just two PPs. ``` ...011110... = ...1000000... - ...000010... = ...0100010... ``` where $\overline{1}$ indicates subtracting a PP instead of adding it. Thus we've reencoded the multiplier using 1,0,-1 digits - aka *canonical signed digit* - greatly reducing the number of additions required. # Booth Recoding: Higher-radix mult. Idea: If we could use, say, 2 bits of the multiplier in generating each partial product we would halve the number of columns and halve the latency of the multiplier! 6.111 Fall 2016 Lecture 8 42 ## Booth recoding On-the-fly canonical signed digit encoding! | current bit pair | | | from previous bit pair | | | |------------------|------------------|----------------|------------------------|---------|----------| | | B <sub>K+1</sub> | B <sub>K</sub> | B <sub>K-1</sub> | action | | | | 0 | 0 | 0 | add 0 | | | | 0 | 0 | 1 | add A | | | | 0 | 1 | 0 | add A | | | | 0 | 1 | 1 | add 2*A | | | | 1 | 0 | 0 | sub 2*A | | | | 1 | 0 | 1 | sub A | ← -2*A+A | | | 1 | 1 | 0 | sub A | | | | 1 | 1 | 1 | add 0 | ← -A+A | | | | | lack | • | | A "1" in this bit means the previous stage needed to add 4\*A. Since this stage is shifted by 2 bits with respect to the previous stage, adding 4\*A in the previous stage is like adding A in this stage! ## Summary - Performance of arithmetic blocks dictate the performance of a digital system - Architectural and logic transformations can enable significant speed up (e.g., adder delay from $\mathcal{O}(N)$ to $\mathcal{O}(\log_2(N))$ - Similar concepts and formulation can be applied at the system level - Timing analysis is tricky: watch out for false paths! - Area-Delay trade-offs (serial vs. parallel implementations) # Lab 4 Car Alarm - Design Approach - Read lab/specifications carefully, use reasonable interpretation - Use modular design don't put everything into labkit.v - Design the FSM! - Define the inputs - Define the outputs - Transition rules - Logical modules: - fsm.v - timer.v // the hardest module!! - siren.v - fuel\_pump.v - Run simulation on each module! - Use hex display: show state and time - Use logic analyzer in Vivado # Car Alarm - Inputs & Outputs Figure 1: System diagram showing sensors (inputs) and actuators (outputs) ### Car Alarm - CMOS Implementation #### Design Specs - Operating voltage 8-18VDC - Operating temp: -40C +65C - Attitude: sea level - Shock/Vibration #### Notes - Protected against 24V power surges - CMOS implementation - CMOS inputs protected against 200V noise spikes - On state DC current <10ma - Include T\_PASSENGER\_DELAY and Fuel Pump Disable - System disabled (cloaked) when being serviced. 47 ## Debugging Hints - Lab 4 Implement a warp speed debug mode for the one hz clock. This will allow for viewing signals on the logic analyzer or Modelsim without waiting for 27/25 million clock cycles. Avoids recomplilations. ``` assign warp_speed = sw[6]; always @ (posedge clk) begin if (count == (warp_speed ? 3 : 26_999_999)) count <= 0; else count <= count +1; end assign one_hz = (count == (warp_speed ? 3 : 26_999_999));</pre> ``` ### One Hz Ticks in Modelsim To create a one hz tick, use the following in the Verilog test fixture: ``` always #5 clk=!clk; always begin #5 tick = 1; #10 tick = 0; #15; end initial begin // Initialize Inputs clk = 0; tick = 0; . . . . ``` # For Loops, Repeat Loops in Simulation ``` integer i; // index must be declared as integer integer irepeat; // this will just wait 10ns, repeated 32x. // simulation only! Cannot implement #10 in hardware! irepeat =0; repeat(32) begin #10: irepeat = irepeat + 1; end // this will wait #10ns before incrementing the for loop for (i=0; i<16; i=i+1) begin #10; // wait #10 before increment. // @(posedge clk); // add to index on posedge end // other loops: forever, while ``` ## Edge Detection ``` reg signal_delayed; ``` ``` always @(posedge clk) signal_delayed <= signal;</pre> ``` ``` assign rising_edge = signal && !signal_delayed; assign falling_edge = !signal && signal_delayed; ``` ## Student Comments - "All very reasonable except for lab 4, Car Alarm. Total pain in the ass." - "The labs were incredibly useful, interesting, and helpful for learning. Lab 4 (car alarm) is long and difficult, but overall the labs are not unreasonable."