CS355 Sylabus

# Solving ALU instruction Data Hazard in the Basic Pipelining

• Data Forwarding

• The ALU instruction data hazard is caused by some ALU instruction producing new values for a register and that register was not updated immediately,

• The solution is to make the new value available to the next instruction using a "feedback" circuitry.

• The technique used is similar to one found in caches

• Data Forwarding Hardware for ALU instructions

• Let us first examine the hardware needed to solve the ALU instruction data hazard:

We need to add the following circuits:

• Two registers (made with D-flipflops) called FR1 and FR2. The registers are called Forwarding Registers because they will be used to forward fresh values to the ALU.

The input of ForwReg1 is tied to the output of the ALU,

The input of ForwReg2 is tied to the output of ForwReg1,

So: ForwReg1 will contains the value of the current ALU instruction and ForwReg2 contains the value of the one before the current instruction.

• Associated with each Forwarding register is a tag register. The tag register records the register number associated with the Forwarding Register (indicates which register the values belongs to)

The input of tag register 1 is the destination field of the instruction (this field contains the register number that will receive the value)

• The Multiplexor in the EX stage is augmented.

The first multiplexor (yellow one) will select from among: PC1, A, ForwReg1 and ForwReg2.

The selection logic of this multiplexor is as follows:

 ``` if ( Instruction == Branch ) select PC1 as operand else if ( tag1 == Src1 ) select ForwReg1 as operand else if ( tag2 == Src1 ) select ForwReg2 as operand else select A as operand ```

The reason for this should be clear:

• A branch instruction does not use any registers to compute the target addres and will need to use PC1 as first operand, so we select PC1 when the instruction is "branch"

• When we have an ALU, Load or Store, we need to be careful where we get the operand from.

If the register where we get the operand is the register updated by the last instruction (that is when ( tag1 == Src1 ), we use the value stored in the ForwReg1 as operand.

If the register where we get the operand is not the register updated by the last instruction but was updated by the instruction before the last instruction (that is when ( tag2 == Src1 ), we use the value stored in the ForwReg2 as operand.

If neither case hold, we can use the value fetched from the register without any problem.

The second multiplexor (gold colored one) will select from among: IR1 (a constant), B, ForwReg1 and ForwReg2.

The selection logic of this multiplexor is as follows:

 ``` if ( Instruction Imm bit is set to 1 ) select IR1 as operand else if ( tag1 == Src2 ) select ForwReg1 as operand else if ( tag2 == Src2 ) select ForwReg2 as operand else select B as operand ```

The reason for this is similar to the first case:

• An instruction that has the Imm bit set will use a constant as the second operand, so we select IR1 as operand when the Imm bit is set to "1"

• When we have an ALU, Load or Store, again, we need to be careful where we get the operand from.

If the register where we get the operand is the register updated by the last instruction (that is when ( tag1 == Src2 ), we use the value stored in the ForwReg1 as operand.

If the register where we get the operand is not the register updated by the last instruction but was updated by the instruction before the last instruction (that is when ( tag2 == Src2 ), we use the value stored in the ForwReg2 as operand.

If neither case hold, we can use the value fetched from the register without any problem.

Let us first look at an example and convince ourselves that the solution works. Later, I will should you how the two multiplexor in the EX stage is constructed (it's pretty straightforward).

• Same example...

• Reconsider the following program that is executed by the IMPROVED pipeline:

```   ADD R2, R3, R1            R2=11, R3=9, R4=1, R5=8, R6=0, R7=2
...
```

• The correct behavior (one that an assembler programmer would expect) is: R1 := R2+R3 = 20, then the other instructions will add R1=20 to registers R4 (20+1), R5 (20+8), R6 (20+0) and R7

• CPU Cycle 1

• At start of the CPU cycle, the IF stage sends out PC
• At end of the CPU cycle, the IR(ID) register is updated with the instruction fetched (ADD R2, R3, R1)

• The picture above depicts the content of the CPU at end of the first CPU cycle (and the start of the 2nd cycle) - nothing special happened....

• CPU Cycle 2

• At start of the CPU cycle, the ID stage sends out selection signal that selects values from R2 and R3

• At end of the CPU cycle, A register is updated with R2 = 11, B register is updated with R3= 9.

• Also, at the end of the CPU cycle, the instruction (ADD R2, R3, R1) is moved into IR(EX) and instruction ADD R4, R1, R4 is fetched into IR(ID)

• The picture above depicts the content of the CPU at end of the second CPU cycle (and the start of the 3rd cycle) - still nothing special happened....

• CPU Cycle 3

• At start of the CPU cycle, the EX stage selects values from R2 and R3 for the ALU, use the ALU opcode to make ALU add the input values forming the result 20 (which will become the value of R1)

The difference now is the result (20) will also be written into the Forwarding Register 1 along with the register tag = 001 (indicating register R1)

Also, at start of the CPU cycle, the ID stage selects R4 and R1 to be copied into the A and B registers,

Notice that an OLD value of R1 will still be fetched into B. (That is not a problem because we will find a way to obtain the more recent value from the forwarding registers - see next CPU cycle).

• The picture above depicts the content of the CPU at end of the 3rd CPU cycle (and the start of the 4th cycle)
• Something special is about to happen.... But first: note in the above picture that the NEW value (20) of R1 is one of the inputs of the Multiplexor (see the red line) along with the old value of R1 (see the magenta line).

• CPU Cycle 4

• At start of the CPU cycle, the EX stage will apply the new selection logic to select the operand ( click here )

Notice that the Src2 field in "ADD R4, R1, R4" contains the bits 001 to indicate R1 !!!

For the first operand, the value from the A register (R4) is selected.

For the first operand, the value of the Forwarding register 1 is select -- because (tag1 (001) == Src2 (001)).

So the ALU will add R4 with the NEW value 20 for R1 (which has not arrived to R1 yet !!!)

Pictorially:

• At the end of the cycle, ALUo will be updated with 20+1 = 21, which is the CORRECT outcome because R1 will be equal to 20).

• Also at the end of the cycle, the value in the Forwarding Register 1 is copied (preserved for one more CPU cycle) to Forwarding Register 2.

• At the end of the cycle, the result of the ALU (21) is copied into Forwarding Register 1 along with the register tag "100" indicating register R4 - this Forwarding Register value will aid in "correcting" a subsequent instruction that use R4 as a source operand !!!

• The picture above depicts the content of the CPU at end of the 4th CPU cycle (and the start of the 5th cycle)
• Note in the above picture that the NEW value (20) of R1 is still available in Forwarding Register 2 as one of the inputs of the Multiplexor (see the red line) along with the old value of R1 (see the magenta line).
• Now you should understand why we need two forwarding registers cascaded one after the other. The register value R1 is retained for exactly 2 CPU cycles, because we have previously determined that 2 instructions fails to obtain the more recent value.

• CPU Cycle 5

• At start of the CPU cycle, the EX stage will apply the same new selection magic - oops, I meant "logic" to select the operand ( click here )

Notice that the Src2 field in "ADD R5, R1, R5" contains the bits 001 to indicate R1 !!!

For the first operand, the value from the A register (R5) is selected.

For the first operand, the value of the Forwarding register 2 is select -- because (tag2 (001) == Src2 (001)).

So the ALU will add R5 with the NEW value 20 for R1 (which has still not arrived to R1 yet !!!)

Pictorially:

• At the end of the cycle, ALUo will be updated with 20+8 = 28, which is the CORRECT outcome because R1 will be equal to 20).

• Also at the end of the cycle, the value in the Forwarding Register 1 is copied (preserved for one more CPU cycle) to Forwarding Register 2. So now the value of R1 is lost !!!

NOTE: That is not a problem because we have previously determined that the 3rd instruction following "ADD R2,R3,R1" is able to obtain the correct value straight from R1.

• At the end of the cycle, the result of the ALU (28) is copied into Forwarding Register 1 along with the register tag "101" indicating register R5 - this Forwarding Register value will aid in "correcting" a subsequent instruction that use R5 as a source operand (while the value in Forwarding Register 2 is used to correct instructions that use R4 as source operand) !!!

• The picture above depicts the content of the CPU at end of the 4th CPU cycle (and the start of the 5th cycle)
• Note in the above picture that the NEW value (20) of R1 is still available in Forwarding Register 2 as one of the inputs of the Multiplexor (see the red line) along with the old value of R1 (see the magenta line).
• Now you should understand why we need two forwarding registers cascaded one after the other. The register value R1 is retained for exactly 2 CPU cycles, because we have previously determined that 2 instructions fails to obtain the more recent value.

• "Smart" Multiplexors used in Data Forwarding

• The input to the ALU is now selected from many different sources.

The first input of the ALU can be one of the following:

 The value of the PC - only when the instruction is a branch instruction The A-register The first frowarding register (ForwReg1) The second frowarding register (ForwReg1)

The selection logic (algorithm) can be formulated as follows:

 ``` if ( instruction is a BRANCH instruction ) select PC1; else if ( Tag1 in ForwReg1 == Src1 in instruction ) select ForwReg1; // Because this reg. has the most recent value else if ( Tag2 in ForwReg2 == Src1 in instruction ) select ForwReg2; // Because this reg. has the next recent value else select A-register; ```

• The selection algorithm is implemented in hardware with compare circuits and multiplexors.

(I call this a "smart" multiplexor, but in reality, it is a "hardware if-statement !!!)

• The following circuit can be used to compare 2 binary number and determine if they are equal:

• XNOR (Exclusive NOR) will output ONE if and only if the both input bits are equal

• The circuit will output ONE if and only if ALL pairs of bits are equal....

• This equality circuit is used with a number of multiplexors to construct the "if-else" selection algorithm in hardware.

• The following circuit diagram shows the implementation of the forwarding algorithm with multiplexors for the first input of the ALU (the yellow multiplexor):

(The red circle represents the "compare equal" circuit above).

• Notice that the last multiplexor will affect the input most, and this mux selects PC1 for the branch and the value from the second last multiplexor.

The PC1 value is selected when the instruction is a BRANCH instruction and otherwise, the value from the second last multiplexor is selected.

• Complete Picture

• The following figure shows how the connection is made to the Forwarding registers:

• The tag and value in the first frowarding register is copied to the second frowarding register

That's because we need to maintain their value for 2 instructions

(After 2 instructions, we can get the correct values from the registers !!!)

• The figure should be self-explanatory if you follow the above discussion....

• The MUX for the second source operand of the ALU is constructed in a similar manner.