In a 5-stage pipelined Beta, when does the hardware use its
ability to insert NOP into the instruction stream at the IF stage
(using the MUX controlled by AnnulIF)?
In a 5-stage pipelined Beta, when does the hardware use its
ability to insert a NOP into the instruction stream at the ALU stage
(using the MUX controlled by AnnulALU)?
Ben Bitdiddle is thinking about modifying a 5-stage pipelined
Beta to add a "Jump if Memory Zero" instruction (JMZ) that fetches
the contents of a memory location and jumps if the fetched value
is zero. How many branch delay slots would follow a JMZ instruction
in the modified 5-stage pipelined Beta?
PUSH (R1) PUSH (R2) LD (BP, -12, R0) LD (BP, -16, R1) CMPEQ (R0, R1, R2) BT (R2, L1)When the CMPEQ is executed, assuming no interrupts, where does the value for R0 come from? How about the value for R1? (The choices would be from the register file or bypassed from one of the pipeline stages.)
Which of the following pipeline hazards cannot be dealt with
transparently and at no performance cost by bypassing?
The number of branch delay slots reflects
loop: LD(R31, status, R0)
BEQ(R0, loop, R31)
ADD (R0, R1, R2)
The following pipeline diagram illustrates the execution of this
instruction sequence on a standard 5-stage pipelined Beta:.
How many clock cycles does it take to execute one iteration of the
2-instruction loop given?
What aspect of the instruction sequence causes NOP1 to be
inserted into the pipeline?
What aspect of the instruction sequence causes NOP2 to be
inserted into the pipeline?
What aspect of the instruction sequence causes NOP3 to be
inserted into the pipeline?
ADD(R31,R31,R31) | NOP ADD(R1,R2,R1) LD(R1,4,R1) SUB(R1,R5,R6) ORC(R1,123,R1) SHL(R1,R1,R1)
Which input is selected by the Ra bypass MUX when the ADD instruction
is in the ALU stage?
Which input is selected by the Ra bypass MUX when the LD instruction
is in the WB stage?
Each of the following scenarios shows a snapshot
of a 5-stage Beta executing a sample code sequence. For each scenario,
indicate the appropriate settings for the bypass muxes, the IR muxes,
and the IR/ALU regs load enable signals. Then draw another snapshot
showing the state of the 5-stage Beta on the following cycle.
. = 0x200 ADDC(R31,10,R0) ADD(R2,R0,R1) CMPLE(R0,R1,R2) BT(R2,Loop,R31)

. = 0x100
LOOP: ADD(R1,R2,R3)
CMPLEC(R3,100,R0)
BT(R0,Loop,R31)
SHLC(R3,1,R3)

. = 0x60 LD(R31,124,R0) ADDC(R0,1,R0) ST(R0,124,R31)



. = 0x60 LD(R31,-1,R0) ADDC(R0,1,R0)

. = 0x100 . = 0x0 ADD(...) IHANDLER: ADDC(SP,4,SP) | PUSH(XP) MUL(...) ST(XP,-4,SP) SUB(...) | Interrupt here ...



ADDC(R31, 3, R0) SUBC(R0, 1, R1) MUL(R0, R1, R2) XOR(R0, R2, R3) ST(R3, 0x1000, R31)
ADDC(R31, 3, R0) | R0 = 3 SUBC(R0, 1, R1) | R1 = 2, R0 bypassed from ALU MUL(R0, R1, R2) | R2 = 6, R0 bypassed from MEM, R1 bypassed from ALU XOR(R0, R2, R3) | R3 = 5, R0 bypassed from WB, R2 bypassed from ALU ST(R3, 0x1000, R31) | 5 is stored in location 0x1000, R3 bypassed from ALU
ADDC(R31, 3, R0) | R0 = 3 SUBC(R0, 1, R1) | R1 = 2, R0 bypassed from ALU MUL(R0, R1, R2) | R2 = 6, R0 bypassed from MEM, R1 bypassed from ALU XOR(R0, R2, R3) | R3 = 6, R0 bypassed from WB (as 0), R2 bypassed from ALU ST(R3, 0x1000, R31) | 6 is stored in location 0x1000, R3 bypassed from ALU
LDR(.+8,LP) BR(f,r31) LONG(.+4)
P-R-U reasons that instructions that leave out the MEM stage can
complete a cycle earlier and thus most programs will run 20% faster!
In your answers below assume that both the original and the P-R-U
pipelined implementations are fully bypassed.
foo: LONG( 0 ) LD( foo, R0 ) ADD( R1, R2, R3 )To execute this sequence correctly the pipeline diagram must look like this:
The stall occurs when the ADD and the LD attempt to use the WB stage
at the same time, forcing the ADD instruction to remain in a wait
stage during t5.
S1: ADD(R1, R2, R3) SUB(R2, R3, R4) CMPLT(R3, R4, R5) S2: ADD(R1, R2, R3) NOP SUB(R2, R3, R4) NOP CMPLT(R3, R4, R5) S3: ADD(R1, R2, R3) NOP SUB(R2, R3, R4) CMPLT(R3, R4, R5)
ADDC(R31, 10, RO) SUBC(R0, 5, R1) ANDC(R0, 6, R2) ORC(R0, 7, R3) CMPLTC(R0, 11, R4)
The CMPLTC will be the first instruction to fetch the new value of
R0. All the preceding instructions will be using the previous value(s)
of R0. The ADDC instruction is in the Write Back stage while ORC is in
the Register File stage-so the new R0 is not written back in time for
the ORC to read it.
For the working Beta, S1, S2, and S3 all compute the same results.
Initially: Reg[ R1 ] = -1, Reg[ R2 ] = 1, Reg[ R3 ] = 5, Reg[ R4 ] = -1
ADD( R1, R2, R3 ) Reg[ R3 ] = Reg[ R1 ] + Reg[ R2 ] = (-1) + 1 = 0 SUB( R2, R3, R4 ) Reg[ R4 ] = Reg[ R2 ] - Reg[ R3 ] = 1 - 0 = 1 CMPLT( R3, R4, R5 ) Reg[ R5 ] = (Reg[ R3 ] < Reg[ R4 ]) = (0 < 1) = 1so Reg[ R5 ] = 1 for all three cases. For the Buba (italics denote cases in which the Buba is different from a working Beta, in which the most recently calculated result is not being used):
S1: ADD( R1, R2, R3 ) Reg[ R3 ] = Reg[ R1 ] + Reg[ R2 ] = (-1) + 1 = 0 new value of Reg[R3] not available yet SUB( R2, R3, R4 ) Reg[ R4 ] = Reg[ R2 ] - Reg[ R3 ] = 1 - 5 = -4 new values of Reg[ R3 ] and Reg[ R4 ] not available yet CMPLT( R3, R4, R5 ) Reg[ R5 ] = (Reg[ R3 ] < Reg[ R4 ]) = (5 < -1) = 0 Reg[ R5 ] = 0 S2: ADD( R1, R2, R3 ) Reg[ R3 ] = Reg[ R1 ] + Reg[ R2 ] = (-1) + 1 = 0 NOP new value of Reg[ R3 ] not available yet SUB( R2, R3, R4 ) Reg[ R4 ] = Reg[ R2 ] - Reg[ R3 ] = 1 - 5 = -4 NOP new value of Reg[ R4 ] not available yet (but Reg[ R3 ] is available) CMPLT( R3, R4, R5 ) Reg[ R5 ] = (Reg[ R3 ] < Reg[ R4 ]) = (0 < -1) = 0 Reg[ R5 ] = 0 S3: ADD( R1, R2, R3 ) Reg[ R3 ] = Reg[ R1 ] + Reg[ R2 ] = (-1) + 1 = 0 NOP new value of Reg[ R3 ] not available yet SUB( R2, R3, R4 ) Reg[ R4 ] = Reg[ R2 ] - Reg[ R3 ] = 1 - 5 = -4 new values of Reg[ R3 ] and Reg[ R4 ] not available yet CMPLT( R3, R4, R5 ) Reg[ R5 ] = (Reg[ R3 ] < Reg[ R4 ]) = (5 < -1) = 0 Reg[ R5 ] = 0
ADD(R3, R4, R5) SUB(R5, R6, R7) ADD(R1, R2, R3) MUL(R7, R1, R2) ADD(R4, R3, R5) CMPLE(R7, R8, R9) DIV(R7, R8, R10) BEQ(R5, done) ADDC(R1, 1, R5)
ADD( R3, R4, R5 ) NOP NOP NOP | Reg[R5] has not yet been updated SUB( R5, R6, R7 ) ADD( R1, R2, R3 ) NOP NOP | Reg[R7] has not yet been updated MUL( R7, R1, R2 ) ADD( R4, R3, R5 ) CMPLE( R7, R8, R9 ) DIV( R7, R8, R10 ) NOP | Reg[R5] has not yet been updated BEQ( R5, done ) NOP ADDC( R1, 1, R5 )The NOP after the BEQ instruction is necessary so that ADDC will only be executed if the branch is not taken.
XAdr: ADDC(SP,4,SP)
ST(R0,-4,SP)
...
First, consider this code fragment:
. = 0x1234
start: CMPLTC(R1,0,R2)
SUB(R3,R2,R3)
XOR(R0,R3,R0)
MUL(R1,R2,R3)
SHLC(R1,2,R4)
The interrupt causes the address 0x123C to be stored in XP. When
the interrupt handler is done it should return to the SUB instruction
at 0x1238. If it would return to the address in the XP, then the SUB
instruction would never get executed, because it was not executed
before the interrupt handler.
skip: BR(NEXT)
CMPLTC(R1,0,R2)
ADD(R3,R2,R3)
next: XOR(R0,R3,R0)
MUL(R1,R2,R3)
SHLC(R1,2,R4)
Complete the diagram for normal execution of the instructions starting
at skip.
After the interrupt handler is finished, it will return to the
CMPLTC instruction. That clearly is not the correct behavior because
we want the branch to be taken and CMPLTC to be annulled.
X: BR(Y) Y: BR(X)
The address stored in XP is the instruction following the BR(X), so
when returning from the interrupt handler, then the XP is adjusted so
it has the address of the BR(X). A similar argument could be made if
the interrupt arrives while annulling the branch delay slot of the
BR(X) instruction.