9. Caches

Basics

Caches are a layer of memory hierarchy between registers and memory
They are not visible in application programming model
Buffer the references to main memory
Necessary in every g.p. computer due to growing performance gap between processor and memory
University/WUT/ECOAR/pictures/Pasted image 20250413153453.png

Cache operation

During every memory reference, the cache is checked for availability of data from a given address
Data not found - cache miss
data is fetched from memory and passed to processor
the data is also placed in the cache, together with its address
if cache is full - some item must be removed to make space for new data
during the next reference to the same address, data will be already in cache
Data found in cache - cache hit
data is fetched from the cache, no need to access memory
cache access is significantly faster than memory access

Locality of references

During a limited period of time, the processor references only a small range of memory addresses.
Plot shows the probability distribution of address access to during time
University/WUT/ECOAR/pictures/Pasted image 20250413153847.png|400x400
The range of references is limited
the set of addresses accessed by the processor in a limited period of time - working set
a relatively small buffer is capable of storing the content of a working set
The references are performed repetitively to the same addresses
The references to the consecutive addresses is very likely
good idea - while filling the buffer, fetch content of several consecutive addresses in advance

Fully associative cache

Most intuitive cache
Ineffective, difficult to implement - low capacity
Build using associative memory
no addressing needed for access
data is accessed based on association (matching) of data
memory response is data matched with input pattern
similar to a phone book

Characteristics

Every cell may store data from any address
all the cells must be searched during every access
Selection of victims (lines to replace upon miss) - LRU or random
Every CAM cell has its own tag comparator
With LRU algorithm, usually as soon as the size of working set exceeds cache capacity, all references result in misses

Cache design details

Data is stored in big units instead of bytes or words
units are called lines
usually ×4 the size of memory word (32 or 64 bytes)
size-aligned
The LSB of address are used to select part of line

Direct mapped cache

Build using ordinary fast RAM and single equality comparator
fast and very simple to implement
high capacity thanks to low complexity
Simple but non-intuitive operation
LSB of address select a part of line
'middle', less significant part serves as the address of cache RAM
Each line contains address tag and data
Address tag contains the most significant part of address of data

Operation

During every access cycle one line is selected based on middle address bits
Cache hit - address tag matches the most significant bits of address sent by the processor
data is transferred from cache to the processor
Cache miss - selected line is replaced
MSB of address are stored in address tag
data is fetched from memory

Characteristics

Low cost, high capacity, high efficiency
No choice of replacement algorithm
data from given address may be stored only in one cache line
the cache cannot simultaneously store data from two addresses with identical middle parts (doesn't happen to often)
In case of dense working set (program loop, data table), the direct-mapped cache speeds up the references as long as the working set size is smaller then ×2 cache capacity

Set-associative cache

Constructed from several direct-mapped caches (component caches called blocks)
Number of possible locations of data from a given address is equal to number of blocks
every cycle searches for a single line of every block
collection of selected lines is called a set
such set behaves like a tiny fully-associative cache
The number of blocks is called the cache associativity
the cache may be described as two-way or four-way
Set-associative cache may also be treated as connection of many small fully-associative caches

Operation

Design must guarantee that the data from a given address is never stored in more than one cache block
During miss, a line selected from the set is replaced
Set-associative characteristic is similar to direct-mapped one, but the problem of storing the data with identical middle part of address is solved

Cache organizations - summary

Most common cache architecture is set-associative
better characteristics than direct-mapped at marginally higher cost
Fully-associative caches are no longer used to store code and data

Hit ratio

Proportion of cache hits to total number of accesses in a given time period h=ncachentotal
Depends on

Approximate relation between cache size and hit ratio
University/WUT/ECOAR/pictures/Pasted image 20250413160717.png
0.9 is achieved at about 8KiB
Above 0.9, architecture and replacement algorithm become significant factors

Average access time

Average access time of memory hierarchy composed of single cache and main memory tAVG=htCACHE+(1h)tMEM
h - cache hit ratio m=1h - cache miss ratio
The cache connected to the processor must be designed in such way that it is able to supply data as fast as the processor requires - without any delays (assume tCACHE=1)

Conclusions

Intuitively high hit ratio does not translate to balanced cooperation between the processor and memory hierarchy
A single cache may compensate the difference in access time not exceeding 10x
To reduce average access time - improve miss response time

Multilevel caches

The required speed of L1 cache is a limiting factor for its capacity and associativity
L2 may be slower than L1, so its associativity may be higher and it can be bigger
If L2 together with memory does not provide satisfactory average access time, L3 may be introduced

Cache operation during write cycles

To possible reactions of cache during write cycles

Write policies

University/WUT/ECOAR/pictures/Pasted image 20250413162610.png

Inclusive caches

used until 2000
Data flow: Memory <-> L2 <-> L1 <-> Processor
Every object contained in an upper layer is also present in lower layers
Effective capacity of caches is equal to the capacity of the biggest cache

Exclusive caches

after 2000
Typically L2 is filled with objects removed from L1
L2 - victim cache
Data flow: Memory <-> L1 <-> Processor, L1 <-> L2 <-> Memory
L2 contains objects not present in L1
During L1 miss causing L2 hit, lines are exchanged
Effective total capacity is a sum of capacities of all levels
L2 capacity may equal to L1
L2 associativity should be higher than L1
University/WUT/ECOAR/pictures/Pasted image 20250413163124.png

L1 read miss handling

L2 hit:

Coherency of memory hierarchy

Memory hierarchy must be coherent
Every access to a give addr must yield the same data value, regardless of the layer being accessed
The coherency poses a problem when there is more than one path to the memory hierarchy:

Methods of achieving coherency

Invalidation of whole cache content upon detection of an external access to memory - used when caches were small
Selective invalidation of a line potentially storing data from address being accessed (until ~1990)
Software invalidation of a whole cache after external access
Selective change in line state after detecting the access to the address range stored in a line

Cache coherency protocols

State machine implemented for each line separately
Line states: