Report Number: CSL-TR-94-628
Institution: Stanford University, Computer Systems Laboratory
Title: Tolerating Latency Through Software-Controlled Data
Prefetching
Author: Mowry, Todd C.
Date: June 1994
Abstract: The large latency of memory accesses in modern computer
systems is a key obstacle to achieving high processor
utilization. Furthermore, the technology trends indicate that
this gap between processor and memory speeds is likely to
increase in the future. While increased latency affects all
computer systems, the problem is magnified in large-scale
shared-memory multiprocessors, where physical dimensions
cause latency to be an inherent problem. To cope with the
memory latency problem, the basic solution that nearly all
computer systems rely on is their cache hierarchy. While
caches are useful, they are not a panacea.
Software-controlled prefetching is a technique for tolerating
memory latency by explicitly executing prefetch instructions
to move data close to the processor before it is actually
needed. This technique is attractive because it can hide both
read and write latency within a single thread of execution
while requiring relatively little hardware support.
Software-controlled prefetching, however, presents two major
challenges. First, some sophistication is required on the
part of either the programmer, runtime system, or
(preferably) the compiler to insert prefetches into the code.
Second, care must be taken that the overheads of prefetching,
which include additional instructions and increased memory
queueing delays, do not outweigh the benefits.
This dissertation proposes and evaluates a new compiler
algorithm for inserting prefetches into code. The proposed
algorithm attempts to minimize overheads by only issuing
prefetches for references that are predicted to suffer cache
misses. The algorithm can prefetch both dense-matrix and
sparse-matrix codes, thus covering a large fraction of
scientific applications. It also works for both uniprocessor
and large-scale shared-memory multiprocessor architectures.
We have implemented our algorithm in the SUIF (Stanford
University Intermediate Form) optimizing compiler. The
results of our detailed architectural simulations demonstrate
that the speed of some applications can be improved by as
much as a factor of two, both on uniprocessor and
multiprocessor systems. This dissertation also compares
software-controlled prefetching with other latency-hiding
techniques (e.g., locality optimizations, relaxed consistency
models, and multithreading), and investigates the
architectural support necessary to make prefetching
effective.
http://i.stanford.edu/pub/cstr/reports/csl/tr/94/628/CSL-TR-94-628.pdf