CS355 Sylabus

We now examine in detail how the basic pipeline CPU executes the load instruction.

• Our little toy pipelined CPU will be able to support 2 types of load instructions:

• LD [R1 + R2], R3: load word from memory address (location) given by R1+R2 into register R3.
• LD [R1 + N], R3: load word from memory address (location) given by R1+N (N is a constant) into register R3.

• Example 1: LD [R1 + R2], R3 (R3 := Memory[R1 + R2])

The instruction is encoded as follows:

```		Br Cond
------
LD  ST  BRA  Opcode  Im Cen <-- Dest -> <- Src1 --> <-- Src2 ->
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
```

• Example 2: LD [R1 + 2], R3 (R3 := Memory[R1 + 2])

The instruction is encoded as follows:

```		Br Cond
------
LD  ST  BRA  Opcode  Im Cen <-- Dest -> <- Src1 --> <-- Src2 ->
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
```

• How is the load instruction executed

• Example:

 ``` ld [R1 + R2], R3 (Load data from memory addr R1+R2 into reg R3) ```

Execution:

• Fetch the operands:

 Fetch the value in R1 Fetch the value in R2

• Access the memory location:

 Send: R1 + R2 on the address bus (and give MEMREQ and READ signal to memory) Read the data on the data bus (stored in LMDR)

• Write the fetch value to register:

 Update R3 with the value in LMDR

• Step 1: Fetch

• At start of the CPU cycle, the IF stage sends out PC
• At end of the CPU cycle, the IR(ID) register is updated with the instruction fetched (LD [R1+R2], R3)

• The picture above depicts the content of the CPU at end of the first CPU cycle (and the start of the 2nd cycle)

• Step 2: Decode (fetch operands)

• At start of the CPU cycle, the ID stage sends out selection signal that selects values from R1 (on A bus), R2 (on B bus) and R3 (on D bus).
• At end of the CPU cycle, A register is updated with R1, B register is updated with R2 and D register is updated with R3 (but R3 will not be used).
• Also, at the end of the CPU cycle, the instruction (LD [R1+R2], R3) is moved into IR(EX) and a new instruction is fetched into IR(ID)

• The picture above depicts the content of the CPU at end of the second CPU cycle (and the start of the 3rd cycle)

• Step 3: Execute (compute)

• At start of the CPU cycle, the EX stage selects values from R1 and R2 for the ALU, use the ALU opcode to make ALU add the input values
• At end of the CPU cycle, ALUo and DMAR registers is updated with the value R1+R2 and SMDR is updated with the value R3 (bute the value R3 will not be used and discarded - see next step)
• Also, at the end of the CPU cycle, the instruction (LD [R1+R2], R3) is moved into IR(MEM) and a new instruction is fetched into IR(ID)

• The picture above depicts the content of the CPU at end of the 3rd CPU cycle (and the start of the 4th cycle)

• Step 4: Memory Access or Branch

• The MEM stage uses the system bus to access the memory to perform the load operation
• The IF stage also wants to use the system bus to fetch the next instruction
• The MEM and IF stage cannot share the system bus.
• We must assign priority of its use: give higher priority to the later pipeline stage (in this case: MEM stage)

• Conclusion: the IF stage must not operate during step 4.
• We call a stage that is not operating: stalled

• It's simple to implement stalling: a stage makes a move when it updates its memory elements. So you can make a stage "stall", by not allowing the clock signal to reach the memory elements by using an AND gate to "block" the clock:

• When the Stall input is ZERO, the clock will be smothered :-)

The stall signal is actually derived using a combination of a number of signals - we must find all possible conflict situations and form the stall signal for each conflicting situation. We will learn about other conflicting situations soon....

In this example, you see that first situation where you need to stall:

 Stall the IF stage when the instruction in the MEM stage is a LOAD instruction (this fact can be recognized from the LD bit in the instruction)

• Also: the LOAD instruction in the MEM stage may or may not have any effect on the execution of the instructions in the EX stage.

For now, I will assume that the LOAD instruction does not affect the outcome of the instruction in the EX stage - i.e., in this example, I do not require that the EX stage be stalled.

Later, you will see that you may also need to stall the EX stage when the MEM stage contains a LOAD instruction....

• There is one more caveat that need to be taken care of:

• The ID stage and EX stage can (and will) keep moving...
• Normally, the next instruction (from memory) is fetched into IR(ID)...
• Now that the IF stage is stalled (not working), there is no "next" instruction...
• Yet, we must put some instruction in IR(ID)...
• Solution: insert a "harmless" instruction NOP.

• This "NOP" instruction insertion can be done using a multiplexor in the IF stage:

• The multiplexor has 2 inputs: data bus (instruction from memory) and the instruction code for "NOP"
• When the IF stage is not stalled, the MUX selects the databus input
• When the IF stage is stalled, the MUX selects the NOP input

• At the end of the CPU cycle, ALUo1 registers is updated with the value R1+R2 and the value in SMDR register (R3) is discarded (SMDR is only used in a STORE instruction)
• Also, at the end of the CPU cycle, the instruction (LD [R1+R2], R3) is moved into IR(WB) and the NOP instruction is fetched into IR(ID)

• The picture above depicts the content of the CPU at end of the 4rd CPU cycle (and the start of the 5th cycle)

• Step 5: Update register

• At the start of the CPU cycle, the destination register (R3) is selected by the Dest field of the IR(WB) and the value in LMDR (= 1234 from memory) is selected for output on the C-bus (because the instruction is a LOAD !!!)

• At end of the CPU cycle, the register R3 register is updated with the value 1234
• Also, at the end of the CPU cycle, the instruction (LD [R1+R2], R3) is discarded and a new instruction is fetched into IR(ID)
• Notice that the NOP instruction will proceed through the pipe, since it does no harm, its only effect is that the CPU "run less fast"...

• The picture above depicts the content of the CPU at end of the 5rd CPU cycle (and the start of the 6th cycle)

Demo: Aaron/2-ld-instr

 Executes: ld [R2+R3], R1 R2+R3 = 100000 Memory content at 100000 is 1111111111111110