6. Execution unit

Phases of instruction execution

To execute an instruction, the processor must:
read (fetch) it from memory hierarchy
decode it
read the source arguments if needed
perform an arithmetic/logic/other operation
write the result (if needed)

There phases may be mapped into:

Single-cycle implementation of RISC-V

Sequential circuit which changes its state once during the instruction execution
The instruction is executed in a combinatorial circuit
Assumptions for model processor:

Instruction fetch

PC value used as program memory address
incremented by 4 - nextPC
current PC value used by branches
Instruction word available at the output of instruction memory

Single-cycle processor control unit

Control unit

Parts of instruction word are supplied to the inputs of a control unit
Control unit produces signals used to control all other blocks of processor

Instruction formats

University/WUT/ECOAR/pictures/Pasted image 20250413124641.png

Argument fetch

Register source arguments

  1. Rs
  2. Rt (in all R format instr. and in some I format instr.)
    Immediate argument (I format) - obtained by extending the 16-bit constant field to 32-bits in data extender Ext
    arithmetic instructions, branches, memory references
    logical instructions
    Second argument selected using Mux_ALU depends on instruction
    Destination register number (if used) comes from Rt or Rd field
Schema

University/WUT/ECOAR/pictures/Pasted image 20250413124833.png

ALU, branch

ALU - Arithmetic Logic Unit - performs the operation according to currently executing instruction
Adds the base register and displacement to get memory address
During branches, performs compare operation to obtain the logical value of relation - output on COND

Outputs 4-bit ALU operation code and 5 control signals

ALU argument selections

First source argument is always rs1 register conent
Second source argument may be rs2 or immediate

ALU, data memory, branching

University/WUT/ECOAR/pictures/Pasted image 20250413123229.png

Data memory

Data memory module is idle during all instructions other than loads and stored
ALU output is used as memory address

Access

During stores, Rt register content is stored to memory
During loads, data from memory is stored to Rt

Control path

The control path (at the top of schematic) generates the PC value for the next subsequent instruction
PC adder produces the possible branch target address by adding the current PC value and branch displacement
Branch multiplexer selects between nextPC and branch target
branch target is used when current instruction is conditional branch and the condition is true

Finalizing the instruction

Three possible results

RV32I instruction formats

University/WUT/ECOAR/pictures/Pasted image 20250413123814.png

Presented model limitations

For basic HHL support, JALR (I-format) is missing
JALR may be used for procedure calls and returns
Other application instructions missing (U- and J- format)

Single-cycle design limitations

Harvard architecture - instruction memory is read-only
Fixed order of actions within instructions

Problems

Separate memories - high cost, large size, lack of programmability
Multiple expensive units - three adders (PC, ALU, branch adder)
Low efficiency
Solutions:


Multicycle processor

Minimizes the number of functional blocks by using the same block multiple times during the execution of one instruction
Requires breaking the execution into several phases (cycles)
in every phase every block is used for a different task
number of phases range from 2 to 10
Execution is controlled by a complex control unit implemented as synchronous automate
Number of multiplexers is increased due to the increased number of data paths and complexity of data flow

Unified program and data memory

Analysis

Execution of a single instruction requires several operations
General sequence of execution phases:

  1. Instruction fetch
  2. Instruction decoding
  3. Argument read
  4. Execution (ALU result)
  5. Result write
    Multicycle processors usually implement CISC model

Structure

Memory is placed externally to the processor and is connected to processor using so-called bus
The processor is interfaced to memory via the bus interface unit places inside the processor structore
Processor functional units:

Operation

During various phases of execution, only some of functional blocks are active
Instruction fetch - bus
Instruction decoding - control unit
Argument read - register set or bus
Execution - ALU
Result store - register set or bus
Most units remain idle for most time
Phases of instruction cycle are called machine cycles
Single machine cycle may required multiple clock cycles

Optimization

Number of machine cycles depends on internal data transfer capabilities of the processor
Number of machine cycles and clock cycles may be reduced by introducing extra data buses inside the processor

Instruction fetch optimization

We can improve overall performance, by fetching the next instruction while still executing current one

Instruction prefetch

Implementation - new elements needed, places in bus interface unit

Problems

In CISC, the instructions vary in length
After branch, scanPC value differs from nextPC, as it contains the word fetched from the address following the branch

Instruction queue

Solution to problem caused by CISC having various-length instructions
FIFO buffer instead of single prefetch
After every taken branch the queue contents must be invalidated and nextPC must to copied to scanPC

Branch penalty problem

As the fetch phase cannot be skipped following a taken branch, the instruction at branch target executes slower than other instructions
This difference in execution time is called branch penalty
Statistically branches constitute 7-14% of all instructions
making this a serious cause of efficiency reduction
University/WUT/ECOAR/pictures/Pasted image 20250413131811.png


Decomposition of single-cycle processor

Single-cycle to pipeline

D-type registers are places on cut lines
Instruction execution

Example pipeline structure stages