Hardware Prefetcher

The Sapphire RV64 CPU includes an optional hardware prefetcher that anticipates future memory accesses and fetches data or instructions into the cache before they are explicitly requested by the pipeline. This reduces the likelihood of cache misses, minimizes memory access latency, and improves overall pipeline throughput. The prefetcher operates differently for instructions and data, using separate algorithms optimized for their access patterns.

For instruction prefetch, the prefetcher employs a next-line algorithm. When an instruction fetch occurs, the prefetcher predicts that the next sequential cache line following the current one will likely be accessed soon. Upon fetching the current instruction line from memory, the prefetcher simultaneously checks if the next consecutive cache line is missing, and will initiate a memory request to refill it. This allows the next instruction block to be loaded into the instruction cache while the pipeline continues executing instructions from the current line. If the sequential access pattern continues, the instruction cache contains the next instructions, effectively eliminating the memory latency for sequential instruction streams. If the next-line prediction is incorrect, for example, if a branch is taken and execution jumps elsewhere, the prefetcher may generate some extra work without affecting correctness, as the pipeline relies on the actual instruction fetch to provide correct data.

For prefetch data, the Sapphire RV64 prefetcher uses the reference prediction table (RPT) algorithm, which handles both regular and stride-based memory access patterns. The RPT keeps track of the recent memory addresses accessed by load and store instructions, and records the stride between consecutive accesses for each instruction address. When a particular load or store repeatedly accesses memory with a constant stride, the prefetcher predicts that the next memory access will occur at the address obtained by adding the observed stride to the last accessed address. A memory request for this predicted address is then issued in advance, allowing the data to be loaded into the data cache before the processor issues the next load. This mechanism is particularly effective for array traversal, loop-based computations, and other predictable access patterns. If the stride changes or the prediction is incorrect, the RPT is updated with the new stride information, allowing the prefetcher to adapt dynamically to changing data access behavior.

In operation, the prefetcher continuously monitors both instruction and data accesses. For instructions, it tracks the program counter to determine the current cache line and triggers the next-line fetch accordingly. For data, it records the memory address and stride of each load or store, updating the RPT to refine future predictions. These prefetch operations are performed transparently to the CPU pipeline and do not change the architectural state; the processor always reads from the cache or memory as usual, and correctness is guaranteed even if a prefetch is incorrect. Prefetched lines that are not used may be evicted based on the cache replacement policy, while correctly predicted prefetches reduce stall cycles and improve effective memory bandwidth.

By combining a simple, sequential next-line algorithm for instructions with the adaptive, stride-aware RPT algorithm for data, the Sapphire RV64 SoC prefetcher provides targeted optimization for the different access characteristics of code and data. This design leverages the predictable nature of instruction streams and common data access patterns to reduce latency and improve overall CPU performance without requiring changes to the software.