Report Number: CSL-TR-97-744
Institution: Stanford University, Computer Systems Laboratory
Title: The FLASH Multiprocessor: Designing a Flexible and Scalable System
Author: Kuskin, Jeffrey Scott
Date: November 1997
Abstract: The choice of a communication paradigm, or protocol, is central to the design of a large-scale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all protocol processing. The architecture of the MAGIC chip allows FLASH to support multiple communication paradigms - in particular, cache-coherent shared memory and high-performance message passing - while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and MAGIC, the custom node controller. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The result is a system that is flexible and scalable, yet competitive in performance with a traditional multiprocessor that implements a single communication paradigm completely in hardware. The focus of this dissertation is the architecture, design, and performance of FLASH. Much of the motivation behind the FLASH system and the MAGIC node controller design stems from an examination of the characteristics of protocol code and the architecture of the DASH system, the predecessor to FLASH. This examination led to two major design goals: development of a node controller architecture that can attain high protocol processing performance while still maintaining flexibility and a need to reduce the logic and memory overheads associated with cache coherence. The MAGIC design achieves these goals by implementing on a single chip a programmable protocol engine with an instruction set optimized for the characteristics of protocol code, along with dedicated support logic to alleviate the most serious protocol processing performance bottlenecks - data movement, message dispatch, and lack of close coupling to the node board components. The design of the FLASH node complements the MAGIC design, matching the close coupling and high bandwidth support in MAGIC to provide a balanced node architecture. Next, the dissertation investigates the performance of cache-coherence on FLASH. Performance results are presented from microbenchmarks run on the Verilog RTL of the MAGIC chip and from complete applications run on FlashLite, the FLASH system-level simulator. The microbenchmarks demonstrate that the architectural extensions added to the MAGIC design - particularly the instruction set optimizations to the programmable protocol processor - yield significantly lower latencies and protocol processor occupancies to service the most common types of memory operations. The application results are used to evaluate the performance costs of flexibility by comparing the performance of FLASH to that of a hardwired machine on representative parallel applications and multiprogramming workloads. These results show that poor application memory reference or load balancing characteristics cause the performance of the FLASH system to degrade more rapidly than the performance of the hardwired system; that is, FLASH's performance is less robust. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. Overall, however, the performance of FLASH can be competitive with the performance of the hardwired machine. Specifically, for a range of optimized parallel applications, the performance differences between the hardwired machine and FLASH are small, typically less than 10% at 32 processors and less than 15% at 64 processors. For these programs, either the processor cache miss rates are small or the latency of the programmable protocol processing can be hidden behind the memory access time.
http://i.stanford.edu/pub/cstr/reports/csl/tr/97/744/CSL-TR-97-744.pdf