Report Number: CSL-TR-96-695
Institution: Stanford University, Computer Systems Laboratory
Title: Producer-Oriented versus Consumer-Oriented Prefetching: a
Comparison and Analysis of Parallel Application Programs
Author: Ohara, Moriyoshi
Date: June 1996
Abstract: Due to large remote-memory latencies, reducing the impact of
cache misses is critical for large scale shared-memory
multiprocessors. This thesis quantitatively compares two
classes of software-controlled prefetch schemes for reducing
the impact: consumer-oriented and producer-oriented schemes.
Examining the behavior of these schemes leads us to
characterize the communication behavior of parallel
application programs.
Consumer-oriented prefetch has been shown to be effective for
hiding large memory latencies. Producer-oriented prefetch
(called deliver), on the other hand, has not been extensively
studied. Our implementation of deliver uses a hardware
mechanism that tracks the set of potential consumers based on
past sharing patterns. Qualitatively, deliver has an
advantage since the producer sends the datum as soon as, but
not before, it is ready for use. In contrast, prefetch may
fetch the datum too early so that it is invalidated before
use, or may fetch it too late so that the datum is not yet
available when it is needed by the consumer. Our simulation
results indeed show that the qualitative advantage of deliver
can yield a slight performance advantage when the cache size
and the memory latency are very large. Overall, however,
deliver turns out to be less effective than prefetch for two
reasons. First, prefetch benefits from a "filtering effect,"
and thus generates less traffic than deliver. Second, deliver
suffers more from cache interference than prefetch. The
sharing and temporal characteristics of a set of parallel
applications are shown to account for the different behavior
of the two prefetch schemes. This analysis shows the inherent
difficulties in predicting future communication behavior of
parallel applications from recent history of the application
behavior. This suggests that cache accesses involved with
coherency in general are much less predictable based on past
behavior than other types of cache behavior.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/695/CSL-TR-96-695.pdf