Sachin Tolay

Posted on Jun 19

Understanding CPU Cache Organization and Structure

#memory #ram #performance #tutorial

Software performance is deeply influenced by how efficiently memory is accessed. The story behind memory access latency is layered: it begins with CPU caches, traverses through virtual memory translation, and ultimately reaches physical DRAM. Each layer introduces its own overhead and optimization challenges.

If you’re interested in learning more about how virtual memory and DRAM work, you may want to explore these related articles:

This article focuses solely on CPU caches, the second fastest layer in the memory hierarchy after CPU registers. We’ll dive into the structural design of CPU caches, how they manage data placement and lookup, and how this affects the speed of your code.

Why Do We Need CPU Caches?

CPU caches were introduced to bridge the vast speed gap between fast processors and much slower main memory. Without addressing this gap, CPUs would frequently stall, waiting hundreds of CPU cycles for data.

By the late 1960s, this mismatch had become a major bottleneck. Computer architects considered three possible solutions:

Make main memory faster
- Technologies like magnetic core and early DRAM were too slow to match CPU speeds.
- Faster memory - SRAM was either too expensive or not scalable to large sizes.
Use the CPU more efficiently
- Techniques like pipelining and instruction reordering were emerging.
- While they helped hide latency, they didn’t eliminate it, and the memory bottleneck still remained.
Introduce a small, fast buffer (cache)
- SRAM wasn’t affordable or scalable for large memory sizes, but it was ideal for implementing small, fast memory layers close to the CPU.
- Studies of real-world workloads from operating systems, web servers etc. to machine learning, databases etc. shows that a small portion of memory handles the majority of accesses. This concentration of activity enables caches to be highly effective despite their limited size.

Though all three directions have seen continued development, caches offered the best trade-off between performance, cost, and complexity at the time and became a fundamental part of CPU architecture from the 1970s onward.

Memory Hierarchy: Where Do Caches Fit

Modern computer systems organize memory into a hierarchy to balance speed, capacity, and cost :-

At the top are CPU registers → tiny, ultra-fast storage tightly coupled with the processor.
Just below the registers are CPU caches → small, fast buffers that store recently or frequently used data. They are organized into levels :-
- L1 and L2 caches are private to each CPU core.
- L3 cache is larger and shared across all cores.
When data is needed, the CPU checks these caches in order: L1 → L2 → L3. If the data is not found (a cache miss), it falls back to main memory (DRAM), which is significantly slower.
Each level down the hierarchy offers more capacity and lower cost per bit, but at the cost of higher latency.

Cache Organization And Structure

Caches store data in fixed-size chunks called cache lines, usually 64 bytes long. These lines are grouped into sets, and how many cache lines each set can hold depends on the cache’s associativity (also called the number of ways). This design is similar to a hashmap: each set acts like a hash bucket, and the multiple ways within a set are like chained entries for resolving collisions.

Direct-mapped (1-way associative): Each memory address maps to exactly one set and one specific cache line → so during a lookup, only that single line needs to be checked to determine if the memory address’s data is present.
N-way set-associative (e.g., 2-way associative, 4-way associative): Each set holds multiple cache lines → for example, 4 lines in a 4-way associative cache. A memory address can be stored in any of these lines, so during a lookup, all lines in the set are checked to determine if the memory address’s data is present in the cache.
Fully associative: There are no set divisions → any memory address can be stored in any cache line. During a lookup, all cache lines must be checked to determine if the memory address’s data is present in the cache.

How a Memory Address Maps to Cache: A Hashmap Analogy

CPU caches work a lot like hashmaps. Just as hashmaps use keys to store and retrieve values efficiently, caches use memory addresses to store and find data quickly. Let’s break it down.

Cache Lookup: Using Index, Tag, and Offset from a Physical Address

Index → Finding the Right Set

Hashmap analogy: this is like using a hash function to find the right bucket.

As mentioned earlier, the cache is divided into sets, and each set holds multiple cache lines (depending on the associativity). The index bits of the memory address are used to select the corresponding set.

For example → if your cache has 256 sets, 8 bits of the address would serve as the index (since 2⁸ = 256). This tells the CPU which set to search in → just like a hashmap uses a hash of the key to find a bucket.

Tag → Matching the Entry Inside the Set (Handling Collisions)

Hashmap analogy: once you’re in the bucket, compare stored keys in the bucket to resolve a collision in the bucket.

Because many memory addresses can share the same index (i.e., they map to the same set), the cache must resolve these collisions. This is where the tag comes in.

As mentioned earlier, each set in an N-way set-associative cache contains N cache lines. When the CPU accesses a memory address, it uses the index to locate the set, and then compares the tag from the incoming address against the tags stored in all N cache lines of that set.

If a match is found → cache hit
If no match → cache miss, and one of the existing lines may be evicted to make room

Offset → Extracting the Right Byte

Hashmap analogy: retrieving a specific part of the value associated with the key.

As mentioned earlier, each cache line typically holds a block of contiguous memory, such as 64 bytes. Once the correct cache line is identified through a cache hit (via index and tag comparison), the CPU uses the offset bits from the address to select the exact byte within that cache line. For a 64-byte cache line, we need 6 bits for the offset → since 2⁶ = 64.

Now, we have seen how the CPU uses the index, tag, and offset to find data in the cache on a cache hit. But what if none of the tags in the set match? This is a cache miss, and the CPU must fetch the entire data block (equivalent to cache line) from slower main memory. Since the 64-bit data bus transfers 8 bytes (64 bits) per CPU cycle, it typically takes about 8 CPU cycles to transfer a 64-byte data block into the cache line.

Cache Update: What Happens on a Cache Miss

When a cache miss occurs, the cache must be updated with the data block containing the requested memory address. Here’s how this process works step-by-step:

Fetching Data from Main Memory → The CPU reads the entire data block (e.g., 64 bytes) starting at the aligned memory address from main memory. For example, if the cache line size is 64 bytes and the CPU requests address 0x1A3F, the cache loads the whole 64-byte block starting at the aligned base address 0x1A00 (since 64 bytes require 6 offset bits, the lower 6 bits are cleared).
Locating a Slot in the Set → The block is loaded into the cache set identified by the index bits. If the cache is N-way set associative, the data can be placed in any of the N cache lines within that set. If there’s a free line available, the new block is stored there directly. Otherwise, the cache must evict an existing line based on its replacement policy.
Evicting an Existing Line if Needed → If the set is full (all lines are occupied), the cache uses a replacement policy, like Least Recently Used (LRU), to decide which existing cache line to evict to make room for the new block.
Updating the Tag → The tag field of the selected cache line is updated to reflect the new block’s address, enabling future hits.
Populating the Cache Hierarchy → A cache miss in L1 doesn't just populate L1, the fetched block is inserted into all relevant levels of the cache hierarchy (L1, L2, and L3).
Resuming Access → After the update, the CPU can access the requested byte in the cache line using the offset bits, just as it would on a cache hit.

Comparing Different Set Associative Types

Cache Replacement Policies

During a cache miss, when a new memory block needs to be brought into the cache, and the set it maps to is already full, the cache must decide which existing block to evict. This decision is made using a replacement policy. The most common policies include:

LRU (Least Recently Used) → Evicts the block that hasn’t been used for the longest time.
PLRU (Pseudo-LRU) → An approximation of LRU designed for efficient hardware implementation, often used in modern CPUs to balance complexity and performance.
FIFO (First-In, First-Out) → Evicts the block that has been in the cache the longest, regardless of usage.
Random → Chooses a block to evict at random; simple to implement in hardware, though less predictable.

How Cache Structure Affects Performance

The way caches are organized → including their line size, associativity, and replacement policies → directly influences how well your code performs. Caches are designed to exploit two key principles of memory access: Temporal Locality and Spatial Locality.

Temporal Locality

Programs tend to reuse data they accessed recently. If a value is used in quick succession, keeping it in the cache can significantly reduce memory latency.

Mechanisms Targeting Temporal Locality

Retaining recently accessed data: Frequently used values → like loop counters, accumulator variables, or function arguments → are kept in the L1 cache, where access is fastest. As long as these values are reused quickly, they remain cached.
Smart replacement policies: When a cache needs to evict data, it uses policies like LRU or PLRU to decide which data is least likely to be reused. This helps preserve recently accessed (and likely-to-be-reused) blocks.
Multi-level cache hierarchy: If a block is evicted from L1 due to pressure, it may still exist in L2 or L3 → larger, slightly slower caches that act as a backup. This layered design improves hit rates across varying temporal reuse patterns.
Prefetching with reuse prediction: Modern CPUs can predict reuse based on past access patterns. If the processor or compiler detects that certain memory is accessed repeatedly, it can preemptively fetch or retain that data to improve hit rates.

Example → Local variables in a loop:

int sum = 0;
for (int i = 0; i < 1000; ++i) {
   sum += i;
}

The variables sum and i are updated on every iteration. It remains in a register or L1 cache because of temporal locality.

Example → Global counters or state flags:

int error_count = 0;
void log_error() {
  error_count++;
}

If log_error() is called frequently, the global variable error_count is repeatedly accessed and benefits from temporal locality.

Pitfalls That Break Temporal Locality

Large working sets: If your program uses more data than can fit in the cache, older data gets evicted before it can be reused.
Irregular access patterns: Frequently jumping between unrelated memory regions (e.g., random linked list traversal) prevents the cache from predicting reuse.

Spatial Locality

Programs also tend to access memory locations that are close together. That’s why caches load entire cache lines (e.g., 64 bytes). If you read one byte, there’s a good chance nearby bytes will be accessed soon → so pre-loading them pays off.

Mechanisms Targeting Spatial Locality

Cache lines store blocks of data: When a CPU accesses one memory address, the cache loads an entire cache line from main memory. This block includes the requested byte and adjacent ones, anticipating that they’ll be accessed soon.
Block-based memory transfer: Since a single cache line holds 64 bytes, accessing nearby addresses (like a[i+1], a[i+2], etc.) doesn’t require additional memory accesses → they’re already in the same line or adjacent lines.
Hardware prefetching: Modern CPUs automatically detect sequential access patterns (like array iteration) and start fetching the next cache lines ahead of time, reducing wait time for upcoming memory accesses.

Example → Array iteration

for (int i = 0; i < N; ++i) {
  a[i] = a[i] * 2;
}

Elements a[0], a[1], a[2], … are stored contiguously in memory. Accessing them in order leverages spatial locality → once a cache line is loaded, the next few elements are likely already there.

Example → Struct field access

struct Point { int x; int y; } p;
int sum = p.x + p.y;

Both x and y are stored next to each other in memory. Accessing them together benefits from spatial locality.

Example → Matrix access (row-wise)

for (int i = 0; i < rows; ++i)
  for (int j = 0; j < cols; ++j)
    matrix[i][j] += 1;

If the matrix is stored in row-major order (as in C/C++), this access pattern walks through memory linearly, maximizing spatial locality.

Pitfalls That Break Spatial Locality

Strided accesses with large gaps: Accessing data with large steps skips over useful cache lines.
```
for (int i = 0; i < N; i += 64) {
  process(a[i]);
}
```
Each access lands on a new cache line, wasting preloaded data.
Accessing only one field in a wide struct repeatedly:
```
struct Big { int a; char padding[60]; int b; };
Big arr[100];
for (int i = 0; i < 100; ++i) {
  process(arr[i].a);
}
```
Here, each cache line holds mostly unused data, leading to cache pollution.
Unaligned data structures: If structures span multiple cache lines due to misalignment, accessing even a single field can trigger multiple memory accesses.

If you have any feedback on the content, suggestions for improving the organization, or topics you’d like to see covered next, feel free to share → I’d love to hear your thoughts!

DEV Community