Report Number: CSL-TR-94-617
Institution: Stanford University, Computer Systems Laboratory
Title: Fast Multiplication: Algorithms and Implementations
Author: Bewick, Gary W.
Date: April 1994
Abstract: This thesis investigates methods of implementing binary
multiplication with the smallest possible latency. The
principle area of concentration is on multipliers with
lengths of 53 bits, which makes the results suitable for
IEEE-754 double precision multiplication.
Low latency demands high performance circuitry, and small
physical size to limit propagation delays. VLSI
implementations are the only available means for meeting
these two requirements, but efficient algorithms are also
crucial. An extension to Booth's algorithm for multiplication
(redundant Booth) has been developed, which represents
partial products in a partially redundant form. This
redundant representation can reduce or eliminate the time
required to produce "hard" multiples (multiples that require
a carry propagate addition) required by the traditional
higher order Booth algorithms. This extension reduces the
area and power requirements of fully parallel
implementations, but is also as fast as any multiplication
method yet reported.
In order to evaluate various multiplication algorithms, a
software tool has been developed which automates the layout
and optimization of parallel multiplier trees. The tool takes
into consideration wire and asymmetric input delays, as well
as gate delays, as the tree is built. The tool is used to
design multipliers based upon various algorithms, using both
Booth encoded, non-Booth encoded and the new extended Booth
algorithms. The designs are then compared on the basis of
delay, power, and area.
For maximum speed, the designs are based upon a 0.6mu BiCMOS
process using emitter coupled logic (ECL). The algorithms
developed in this thesis make possible 53x53 multipliers with
a latency of less than 2.6 nanoseconds @ 10.5 Watts and a
layout area of 13 mm@+[2]. Smaller and lower power designs
are also possible, as illustrated by an example with a
latency of 3.6 nanoseconds @ 5.8 W, and an area of 8.9
mm@+[2]. The conclusions based upon ECL designs are extended
where possible to other technologies (CMOS).
Crucial to the performance of multipliers are high speed
carry propagate adders. A number of high speed adder designs
have been developed, and the algorithms and design of these
adders are discussed.
The implementations developed for this study indicate that
traditional Booth encoded multipliers are superior in layout
area, power, and delay to non-Booth encoded multipliers.
Redundant Booth encoding further reduces the area and power
requirements. Finally, only half of the total multiplier
delay was found to be due to the summation of the partial
products. The remaining delay was due to wires and carry
propagate adder delays.
http://i.stanford.edu/pub/cstr/reports/csl/tr/94/617/CSL-TR-94-617.pdf