Report Number: CSL-TR-96-711
Institution: Stanford University, Computer Systems Laboratory
Title: Design Issues in High Performance Floating Point Arithmetic
Units
Author: Oberman, Stuart Franklin
Date: December 1996
Abstract: In recent years computer applications have increased in their
computational complexity. The industry-wide usage of
performance benchmarks, such as SPECmarks, forces processor
designers to pay particular attention to implementation of
the floating point unit, or FPU. Special purpose
applications, such as high performance graphics rendering
systems, have placed further demands on processors. High
speed floating point hardware is a requirement to meet these
increasing demands. This work examines the state-of-the-art
in FPU design and proposes techniques for improving the
performance and the performance/area ratio of future FPUs.
In recent FPUs, emphasis has been placed on designing
ever-faster adders and multipliers, with division receiving
less attention. The design space of FP dividers is large,
comprising five different classes of division algorithms:
digit recurrence, functional iteration, very high radix,
table look-up, and variable latency. While division is an
infrequent operation even in floating point intensive
applications, it is shown that ignoring its implementation
can result in system performance degradation. A high
performance FPU requires a fast and efficient adder,
multiplier, and divider.
The design question becomes how to best implement the FPU in
order to maximize performance given the constraints of
silicon die area. The system performance and area impact of
functional unit latency is examined for varying instruction
issue rates in the context of the SPECfp92 application suite.
Performance implications are investigated for shared
multiplication hardware, shared square root, on-the-fly
rounding and conversion and fused functional units. Due to
the importance of low latency FP addition, a variable latency
FP addition algorithm has been developed which improves
average addition latency by 33% while maintaining
single-cycle throughput. To improve the performance and area
of linear converging division algorithms, an automated
process is proposed for minimizing the complexity of SRT
tables. To reduce the average latency of
quadratically-converging division algorithms, the technique
of reciprocal caching is proposed, along with a method to
reduce the latency penalty for exact rounding. A combination
of the proposed techniques provides a basis for future high
performance floating point units.
http://i.stanford.edu/pub/cstr/reports/csl/tr/96/711/CSL-TR-96-711.pdf