Report Number: CSL-TR-98-759
Institution: Stanford University, Computer Systems Laboratory
Title: Optimized Multiprocessor Communication and Synchronization
Using a Programmable Protocol Engine
Author: Heinlein, John
Date: March 1998
Abstract: In recent years, multiprocessor designs have converged
towards a unified hardware architecture despite supporting
different communication abstractions. The implementation of
these communication abstractions and the associated protocols
in hardware is complex, inflexible, and error prone. For
these reasons, some recent designs have employed a
programmable controller to manage system communication. One
particular focus of these designs is implementing cache
coherence protocols in software. This dissertation argues
that a programmable communication controller that provides
cache coherence can also effectively support block transfer
and synchronization protocols. This research is part of the
FLASH project, a major focus of which is exploring the
integration of multiple communication protocols in a single
multiprocessor architecture.
In our analysis, we examine the needs of protocols other than
cache coherence to identify the requirements they share. The
interface between the processor and controller is one
critical issue in these protocols, so we propose techniques
to export such protocols reliably, at low overhead, and
without system calls. Unlike most prior studies, our approach
supports a modern operating system with features like
multiprogramming, protection, and virtual memory.
Our study focuses in detail on two classes of communication
that are important for large scale multiprocessors: block
transfer and synchronization using locks and barriers. In
particular, we attempt to improve the performance of these
classes of communication as compared to implementations using
only software on top of shared memory. For each protocol we
identify the critical metrics of performance, explore the
limitations of existing techniques, then present our
implementation, which is tailored to leverage the
programmable communication controller. We evaluate each
protocol in isolation, in the context of microbenchmarks, and
within a variety of applications.
We find that embedding advanced communication and
synchronization features in a programmable controller has a
number of advantages. For example, the block transfer
protocol improves transfer performance in some cases, enables
the processor to perform other work in parallel, and reduces
processor cache pollution caused by the transfer. The
synchronization protocols reduce overhead and eliminate
bottlenecks associated with synchronization primitives
implemented using software on top of shared memory.
Simulations of scientific applications running on FLASH show
that, in many cases, synchronization support improves
performance and increases the range of machine sizes over
which the applications scale. Our study shows that embedded
programmability is a convenient approach for supporting block
transfer and synchronization, and that the FLASH system
design effectively supports this approach.
http://i.stanford.edu/pub/cstr/reports/csl/tr/98/759/CSL-TR-98-759.pdf