7. Pipeline execution unit
5 pipeline stages
IF, RD, ALU, MEM, WB
IF and WB need only 1/2 of clock cycle - whole instruction takes 4 cycles
Harvard-Princeton architecture
Structure and operation

Every stage excluding WB has a D-type register at its output
PC and g.p. registers are updated in a half of the clock cycle
IF occurs in the second half of the cycle
WB works in the first half
Source arguments are ready in RD in the second half
RAW Hazard
Consider the following sequence:
add x4, x3, x2
add x6, x5, x4
Problem - what value of x4 will be read by the second instruction?
This problem is called Read-After-Write Hazard
It is necessary to introduce some mechanisms to make the processor's behavior deterministic
Solution 1 - administrative method
The result of above code example is described in documentation as 'undefined'
The programmer is not allowed to use such instruction sequence
Programmer must write many empty instructions in program
add x4, x5, x6
nop
nop
add x6, x5, x4
Solution 2 - pipeline slipping
Hazard detection - combinatorial circuit in RD stage compares source register numbers against destination registers in ALU and MEM stage
Slip - if at least one pair of numbers is matched, the instruction is suspended in RD stage
IF and RD stop
the remaining stages work normally, RD injects empty instruction into ALU
Program executes correctly without NOPs in binary code
technically they are generated internally by the processor
Instruction dependencies cause delays
Solution 3 - bypasses
The result of ALU operation is ready while the instruction is in ALU stage
value is generated while the next instruction is in RD
Bypasses are data buses leading from ALU and MEM to RD
they contain dest register numbers and result values
Read logic in RD
source register numbers are compared against number present on bypasses
Priorities - ALU bypass, MEM bypass, physical register
No need to provide bypass from WB
Bypasses remove RAW hazard without delays

Load-use penalty
Assume the processor equipped with bypasses
Consider following sequence
lw x4, ...
add x6, x5, x4
This time, the RAW hazard results from memory load instruction
Data read from memory is not available until MEM stage
bypass from ALU stage will NOT contain proper data
The problem is called Load-use penalty
This hazard cannot be eliminated without delays
Branch conditions in pipeline
The branch condition and branch target are evaluated in ALU stage during first half of a cycle
At that time, RD already contains the next instruction
Branch instruction may influence the fetch of the second instruction after the branch
the instruction fetched after the branch may be turned into NOP but it will consume time
Branch penalty in pipeline results from non-zero distance between ALU and IF
Reduction of branch penalty
Delayed branch - 'execute the next instruction following the branch and then branch'
Any instruction originally placed before the branch which does not influence the outcome of the branch may be moved to the place after the branch instruction
The place of instruction after the branch which will be executed regardless of the outcome of the branch is called delay slot
When the delay slot size is one instruction, the probability of filling it with some useful instruction is about 90%
Pipeline efficiency
Theoretical efficiency -one cycle per instruction
Delay sources:
- Internal - hazards removed in other ways than bypasses, memory loads, branches
- External to the pipeline - memory hierarchy accesses
Practical efficiency - ~1.2 cycles per instruction
Seeding the pipeline
When the frequency increases, the memory cannot complete access during single cycle
The complexity of some pipeline stages makes it impossible to increase the clock frequency
The solution - redesign pipeline and increase number of stages while reducing the stages' complexity
Long pipeline (over 6 stages) is called superpipeline
Superpipeline

MIPS R4000
8 stages:
- IF - Instruction First - start of instruction fetch
- IS - Instruction Second - completion of instruction fetch
- RD - Read
- EX - Execute - ALU
- DF - Data First - start of data memory access
- DS - Data Second - second phase of memory access
- DTC- Data Tag Check - completion of data access
- WB - Write Back
Memory interface
Memory access divided into two phases
Memory is pipelined
access occurs in two clock cycles, which is naturally compatible with memory internal structure
two accesses are performed simultaneously
Synchronization and delays
Topology of the superpipeline is identical to pipeline
The increased number of stages results in greater distances between stages and bigger delays
More bypasses are needed
Load/use penalty and branch penalty are significantly bigger than in a short pipeline
Load-use penalty
With bypasses, the delay is equal to 3 cycles
not that critical
Memory references
reloading many registers in subroutine's prologue and epilogue
registers are accessed several instructions after load, masking the delay
references to data structures in memory - data frequently needed immediately after load - may cause significant delays
Branch penalty
Delay slot size is bigger than in short pipelines (2 instead of 1)
Probability of filling the delay slot with two useful instructions is low ~20%
The branch delayed by two cycles is not a good solution
also incompatible with earlier implementations
In superpipelined processors without pipelined ancestors, delayed branches are not used
The proper method of reducing the branch penalty in there architectures is branch prediction
Efficiency
Bigger and more frequent delays cause higher CPI ~1.5
But it is partially compensated by higher clock frequency
Net gain is around 50% in stages 5-8
Pipelined implementation of CISC
Due to incompatibility of CISC and pipelined architectures, there is no 'clean' implementation of CISC pipeline
Pipeline requirements not met by CISC
- Sequence of operations for all instructions is fixed
- Instructions have fixed length ! and simple formats
- Every instruction causes at most one data memory reference
- The instruction has at most one destination argument
Possible implementation
Enhanced, complex pipeline capable of executing CISC instructions
Processor divided into two parts:
- Unit fetching CISC instructions and converting them to sequences of RISC primitives
- Pipelined RISC execution unit
Problems
This implementation makes the beginning of a pipeline take multiple clock cycles (fetching, decoding, etc.)
Efficiency - CPI ~2
Frequent stalls and delays

Transcoder
Fetches CISC instructions and translates them into sequences of RISC-like instructions
For simple CISC instructions, 1 to 1 translation may be possible
Slightly more complex instructions converted to 2-4 RISC operations
The most complex instructions are implemented as calls to RISC routines placed in ROM inside the processor