Report Number: CSL-TR-94-602
Institution: Stanford University, Computer Systems Laboratory
Title: Analyzing and Tuning Memory Performance in Sequential and Parallel Programs
Author: Martonosi, Margaret Rose
Date: January 1994
Abstract: Recent architecture and technology trends have led to a significant gap between processor and main memory speeds. When cache misses are common, memory stalls can significantly degrade execution time. To help identify and fix such memory bottlenecks, this work presents techniques to efficiently collect detailed information about program memory performance and effectively organize the data collected. These techniques help guide programmers or compilers to memory bottlenecks. They apply to both sequential and parallel applications and are embodied in the MemSpy performance monitoring system. This thesis contends that the natural interrelationship between program memory bottlenecks and program data structures mandates the use of data oriented statistics, a novel approach that associates program performance information with application data structures. Data oriented statistics, viewed alone or paired with traditional code oriented statistics, offer a powerful, new dimension for performance analysis. I develop techniques for aggregating statistics on similarly-used data structures and for extracting intuitive source-code names for statistics. The thesis also argues that MemSpy's detailed statistics on the frequency and causes of cache misses are crucial in understanding memory bottlenecks. Common memory performance bugs are often most easily distinguished by noting the causes of their resulting cache misses. Since collecting such detailed information seems, at first glance, to require large execution time slowdowns, this dissertation also evaluates techniques to improve the performance of MemSpy's simulation-based monitoring. The first optimization, hit bypassing, improves simulation performance by specializing processing of cache hits. The second optimization, reference trace sampling, improves performance by simulating only sampled portions out of the full reference trace. Together, these optimizations reduce simulation time by nearly an order of magnitude. Overall, having used MemSpy to tune several applications, these experiences demonstrate that MemSpy generates effective memory performance profiles, at speeds competitive with previous, less detailed approaches.
http://i.stanford.edu/pub/cstr/reports/csl/tr/94/602/CSL-TR-94-602.pdf