Report Number: CSL-TR-94-602
Institution: Stanford University, Computer Systems Laboratory
Title: Analyzing and Tuning Memory Performance in Sequential and
Parallel Programs
Author: Martonosi, Margaret Rose
Date: January 1994
Abstract: Recent architecture and technology trends have led to a
significant gap between processor and main memory speeds.
When cache misses are common, memory stalls can significantly
degrade execution time. To help identify and fix such memory
bottlenecks, this work presents techniques to efficiently
collect detailed information about program memory performance
and effectively organize the data collected. These techniques
help guide programmers or compilers to memory bottlenecks.
They apply to both sequential and parallel applications and
are embodied in the MemSpy performance monitoring system.
This thesis contends that the natural interrelationship
between program memory bottlenecks and program data
structures mandates the use of data oriented statistics, a
novel approach that associates program performance
information with application data structures. Data oriented
statistics, viewed alone or paired with traditional code
oriented statistics, offer a powerful, new dimension for
performance analysis. I develop techniques for aggregating
statistics on similarly-used data structures and for
extracting intuitive source-code names for statistics. The
thesis also argues that MemSpy's detailed statistics on the
frequency and causes of cache misses are crucial in
understanding memory bottlenecks. Common memory performance
bugs are often most easily distinguished by noting the causes
of their resulting cache misses.
Since collecting such detailed information seems, at first
glance, to require large execution time slowdowns, this
dissertation also evaluates techniques to improve the
performance of MemSpy's simulation-based monitoring. The
first optimization, hit bypassing, improves simulation
performance by specializing processing of cache hits. The
second optimization, reference trace sampling, improves
performance by simulating only sampled portions out of the
full reference trace. Together, these optimizations reduce
simulation time by nearly an order of magnitude. Overall,
having used MemSpy to tune several applications, these
experiences demonstrate that MemSpy generates effective
memory performance profiles, at speeds competitive with
previous, less detailed approaches.
http://i.stanford.edu/pub/cstr/reports/csl/tr/94/602/CSL-TR-94-602.pdf