Report Number: CSL-TR-96-695
Institution: Stanford University, Computer Systems Laboratory
Title: Producer-Oriented versus Consumer-Oriented Prefetching: a Comparison and Analysis of Parallel Application Programs
Author: Ohara, Moriyoshi
Date: June 1996
Abstract: Due to large remote-memory latencies, reducing the impact of cache misses is critical for large scale shared-memory multiprocessors. This thesis quantitatively compares two classes of software-controlled prefetch schemes for reducing the impact: consumer-oriented and producer-oriented schemes. Examining the behavior of these schemes leads us to characterize the communication behavior of parallel application programs. Consumer-oriented prefetch has been shown to be effective for hiding large memory latencies. Producer-oriented prefetch (called deliver), on the other hand, has not been extensively studied. Our implementation of deliver uses a hardware mechanism that tracks the set of potential consumers based on past sharing patterns. Qualitatively, deliver has an advantage since the producer sends the datum as soon as, but not before, it is ready for use. In contrast, prefetch may fetch the datum too early so that it is invalidated before use, or may fetch it too late so that the datum is not yet available when it is needed by the consumer. Our simulation results indeed show that the qualitative advantage of deliver can yield a slight performance advantage when the cache size and the memory latency are very large. Overall, however, deliver turns out to be less effective than prefetch for two reasons. First, prefetch benefits from a "filtering effect," and thus generates less traffic than deliver. Second, deliver suffers more from cache interference than prefetch. The sharing and temporal characteristics of a set of parallel applications are shown to account for the different behavior of the two prefetch schemes. This analysis shows the inherent difficulties in predicting future communication behavior of parallel applications from recent history of the application behavior. This suggests that cache accesses involved with coherency in general are much less predictable based on past behavior than other types of cache behavior.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/695/CSL-TR-96-695.pdf