# Implementation of a FFT Radix 2 Butterfly Using Serial RSFQ Multiplier-Adders

Oleg A. Mukhanov

Hypres, Inc., 175 Clearbrook Road, Elmsford, NY 10523, USA

Alexander F. Kirichenko.

Nuclear Physics Institute, Moscow State University, Moscow, GSP 119899, Russia

Abstract—We have designed a Decimation-in-Time (DIT) Radix 2 Butterfly integrated circuit. This circuit will be used to implement the 32-point Fast Fourier Transform (FFT) in a parallel data flow architecture. The radix 2 butterfly circuit uses serial RSFQ math and consists of four single bit-wide serial multipliers and eight carry-save serial adders. The circuit with 16-bit word-length employs only 3400 junctions, occupies an area of 3.8 x 2.0 mm², and dissipates less than 1.1 mW power. The multiplier is implemented using the unique RSFQ bit-clock-pipelined schema. We have successfully tested a library of serial multiply-add elements: the 8-bit multiplier at 6.3 GHz and adders with dc bias margin ±20%. Finally, we have demonstrated full operation of the radix 2 butterfly chip with 5-bit word length.

#### I. INTRODUCTION

An area where superconductive digital technology presents an advantage over its semiconductor counterpart is the implementation of computation-intensive digital processors. These processors are employed in applications such as all-digital RF memories, radars, digital HDTV, and computed tomography. In these applications, 1) parallelism is difficult to realize because the number of parallel processors would be large; 2) processing throughput requirements are very high.

Fast Fourier Transform (FFT) is one of those processing intensive operations [1]. Superconductive single-fluxquantum, especially Rapid Single Flux Quantum (RSFQ) [2] digital technology possesses a number of features making RSFO-based FFT designs extremely attractive even at the present maturity level of superconductive circuit processing [3]. The high throughput capability enables the use of a single bit-wide serial processing architecture for many complex arithmetic/logic functions. The internal memory of the RSFQ gates allows the implementation of pipelined arithmetic modules using fewer gates. Both features result in a significant reduction of the circuit complexity. RSFQ circuits do not require a high-power clock since an SFQ clock is generated on-chip. This simplifies high-speed clock delivery and distribution, and makes high-speed operation of an entire

Manuscript received 19 September, 1994.

This work was supported in part by the U.S. Air Force under Grant No. F33615-93-C-1232.

system viable. Finally, extremely low-power dissipation enables high-density, compact packaging at the chip level. In this paper, we show how these features are being realized in the first RSFQ radix 2 butterfly circuit, the key component for the FFT implementation.

## II. DESIGN AND LAYOUT

## A. Serial Radix 2 Butterfly

Fig. 1 shows a block diagram of a DIT radix 2 butterfly which requires a complex multiply and two complex additions [1]. The real implementation requires four real multipliers and six real adders.

Xr' = Xr + (Yr Wr - Yi Wi), Xi' = Xi + (Yi Wr + Yr Wi), Yr' = Xr - (Yr Wr - Yi Wi), Yi' = Xi + (Yi Wr + Yr Wi).

The serial approach allows us a very compact radix 2 butterfly chip with N bit word length. The gate count is 1/N that of the parallel implementation. To multiply, N x 1 bit serial multipliers (SM) are used. To add/subtract, 1-bit carry-save serial adders (CSSA) are used. For design uniformity, the first stage adders are duplicated (Fig. 1).



Fig. 1. Block diagram of the DIT radix 2 butterfly.

Since no high-speed memory is currently available, the simplest (and fastest) parallel FFT architecture is considered. Another advantage of this FFT architecture is that the coefficients are fixed with each multiply. Thus, an off-line serial loading scheme can be used to load the coefficients. For a 32-point FFT, eighty radix 2 butterfly chips are required.

# B. Serial Multiply-Add

The logic diagram of a bit-pipelined RSFQ serial multiplier is shown in Fig. 2. It is a modified version of the SM

1051-8223/95\$04.00 © 1995 IEEE

proposed in [4]. The SM consists of identical modules each comprising three types of RSFQ elementary cells: a latch with non-destructive read-out (NR), a latch (DR), and a carry-save serial adder (CSSA). The NR performs the AND function with latching of the coefficient input. The CSSA performs functions of summation, latching output data and carry, and applying the delayed by one clock cycle carry to its own input. The same CSSA performs an add function in butterfly. To perform a subtract function, one can use a CSSA with pre-loaded "1" and inverter at one of two inputs.

The cells are interconnected with an active Josephson transmission lines (JTL) providing SFO pulse amplification and setting the necessary delays within the SM stage. Since the SM design is buffered, no clock skew build-up will occur along the SM length. The SM takes 32 clock cycles to form the 32-bit product YW of a 16-bit data word Y and a 16-bit coefficient W. It operates by an initial serial loading W into the NR registers. This can be done during the last 16 periods of the previous multiplication cycle. Then, the data word Y is applied serially with the least significant bit (LSB) first. In contrast to similar designs for semiconductor logic [5] or superconductive latching logic [6], the SFO clock pulses are also pipelined, i.e. a few clock SFQ pulses can simultaneously propagate along the SM. The designed clock rate (14.5 GHz) is defined by  $\tau 3+\tau 4$  (see Fig. 2, about 65 ps), the sum of the propagation delay between adjacent modules and of the switching time of a 1-bit CSSA.



Fig. 2. Block diagram of RSFQ serial multiplier.

Fig. 3 shows a schematic of a single-bit SM module. All schematic designs of the module components demonstrate the ability of RSFQ logic to provide quite complex logic functions with a minimum number of Josephson junctions (JJs) by fully exploiting the internal gate memory. The best example of that is the design of a CSSA proposed in [7] and implemented in [8]. All CSSA functions described above are completed without even a physical presence of the required feedback loop. An entire SM module comprises 48 JJs and dissipates only 13  $\mu W$  of power. The simulated margins for all circuits parameters including critical currents, inductances, resistors, and common dc bias exceed  $\pm 32\%$  for the DR/NR part and  $\pm 25\%$  for the CSSA.



Fig. 3. Schematic of a single-bit module of the serial multiplier. Total number of JJs is 48. Underlined inductors designate a storage.

#### C. Layout & Chip Design

The circuits are implemented using HYPRES' standard 10 level Nb process with a minimum JJ size of 3.5 x 3.5  $\mu m^2$  [9]. The JJs are externally shunted with resistors to obtain a non-hysteretic IV-curve at a critical current density of 1 kA/cm<sup>2</sup>.

In order to evaluate the performance of the SM components, we use two different on-chip diagnostic systems. The Logic Tester/SFQ Sampler [3] is used to evaluate the NR-cell and CSSA. Related design/experimental details are described in [8]. The shift register-based test system [3] is used to evaluate the 2-, 4-, and 8-bit serial multipliers.

Fig. 4a shows the fragment of chip layout of SM with the diagnostic system based on shift registers (SR). The area of the 8-bit SM is 1.44 x 0.38 mm<sup>2</sup>. Input data word Y and coefficient W are pre-loaded into the 9-bit input shift registers. The output product YW is off loaded into a 32- or 24-bit shift register with an SFQ/dc converter based on either T flip-flop (TFF) or RS flip-flop (RSFF). Two identical clock generators provide circuit timing. The SM dc power supply is divided

into three groups, NR/DR registers, CSSAs, and the middle clock JTL setting delay  $\tau 3$  (see Fig. 2).



Fig. 4. Layout of the RSFQ serial multiplier. (a) fragment of 8-bit multiplier with shift register-based diagnostic system. (b) zoom-in of single-bit multiplier module. The module size is  $180 \times 360 \ \mu m^2$ .

Fig. 5 shows the layout of a complete radix 2 butterfly circuit with 16 bit word length occupying an area of  $3.8 \times 2.0 \, \mathrm{mm^2}$ . The area of the complete circuit is only twice larger than that of a 4-bit serial multiplier (1/18 of the radix 2 butterfly) designed in latching logic from 97 gates [6]. Total power dissipation is  $1.1 \, \mathrm{mW}$ . Since the bias current is quite significant (410 mA) for the butterfly chip, a serial current supply scheme for the multichip FFT implementation will be advantageous. Each chip can be inductively coupled and serially dc biased.



Circuit size 3.8 x 2.0 mm<sup>2</sup>

Chip size 5 x 5 mm<sup>2</sup>

Fig. 5. Chip layout of the 16 bit word radix 2 butterfly.

For the initial butterfly test, no hardware-implemented subtraction is used. Negative copies of Xr, Xi are supplied externally. For the first design iteration, instead of a single common dc current bias, we use several different biases applied to different sections of the circuit (NR/DRs, CSSAs, clock generators/distribution network). Due to the limited number of remaining pads, we used the same wires to apply Xr and -Xi as well as -Xr and Xi. We apply only positive numbers to the multiplier inputs. We have also designed the smaller (5-bit coefficient-length) radix 2 butterfly chip employing 1,200 JJs. Besides having shorter (5-bit) serial multipliers, the circuit is identical to that shown in Fig. 5.



Fig. 6. Low-speed operation of 8-bit serial multiplier within the shift register diagnostic system. Output is read by RSFF sensor (LSB first). Each set of traces from top to bottom are W, Y, WY. Loaded W is 11111111, Ys are 10000000, 11000000, ..., 111111111.

# III. TEST RESULTS AND METHODOLOGY

# A. Functionality test.

Testing is carried out at low frequency (10 - 100 kHz) using conventional pattern generators and oscilloscopes. Tests of the NR/DR and CSSA circuits are done using an on-chip Logic Tester. The measured common NR/DR dc bias margin is  $\pm 30\%$ . These results are in very good correlation with the simulated margin of  $\pm 32\%$ . The CSSA demonstrates correct operation within  $\pm 20\%$  [8].

Correct and full operations of the 2-, 4-, 8-bit SMs with the shift register test system are successfully demonstrated. Fig. 6 shows an example of correct operation of an 8-bit SM. The measured margins are  $\pm 31\%$  for the NR/DR part,  $\pm 9\%$  for the CSSA part, and  $\pm 18\%$  for the bias of the middle JTL. The smaller margin of the CSSA-part can be explained by the non-optimal adjustment of the delays  $\tau 1 - \tau 4$ .

We have carried out testing of entire radix 2 butterfly with 5 bit word length. An example of correct operation of the circuit is shown on Fig. 7. The measured margins range from  $\pm 23\%$  for the NR/DR-part biasing sections to  $\pm 1\%$  for the CSSA-part biasing sections. It is worth noting that these are results obtained after the first design iteration of the butterfly chip. A complete evaluation of the 16-bit-word version is currently in progress.



Fig. 7. Full operation of the radix 2 butterfly with 5-bit word length. Outputs are read by RSFF sensors (LSBs first). Inputs are Wi - 11111, Yi - 01110, Wr - 11011, Yr - 10101, Xr=-Xi - 1011000000, Xi=-Xr - 0101000000. Note, no subtractors are used in the circuit.

## B. High-Speed Test.

We have tested the 8-bit serial multiplier at high speed using an on-chip shift register-based test system. The system operates as follows. Loading: The clock generators produce nine low-speed (MHz) SFQ clock pulses to provide loading of both input 9-bit shift registers (for W and Y). High-Speed Execution: The high-speed (GHz) clock is enabled by the positive control envelope. The width of this envelope D corresponds to the number  $N_h$  of the high-speed clock pulses as  $N_h = DF_h$ , where  $F_h$  is the frequency of the high-speed clock. For fine adjustment of the positive envelope edges, a mechanical delay line is used. Off-Loading: The clock generator produces 32 low-frequency (MHz) clock pulses to readout the captured data (YW) from the 32-bit output shift register. The length of the output shift register is designed to be larger than a 16-bit YW word to make the choice of the width D less critical.



Fig. 8. High-speed operation of the 8-bit serial multiplier with shift register test system at 6.3 GHz and 33 MHz load/off-load clock. Inputs: W - 111111111, Y - 11000000; output YW - 1011111101000000. Output is read by TFF sensor.

The 8-bit SM operates correctly up to 6.3 GHz. Fig. 8 shows an example of proper operation. At higher frequencies, we observe a significant error rate associated with data

words Y having a pattern with successive 1's. For simple patterns, we find correct operation up to 14.1 GHz. Errors can be caused in part by a delay mismatch at the interface between the SM and I/O shift registers.

#### IV. CONCLUSION

We have designed and fabricated a very compact (3.8 x 2.0 mm²) RSFQ serial DIT radix 2 butterfly circuit consisting of four 16-bit multipliers and eight adders using 200 gates of three different types. Total dissipating power is about 1.1 mW at 2.6 mV across the power bus.

We have successfully demonstrated correct operation of the radix 2 butterfly chip with 5-bit coefficient length. To our knowledge, this is the first demonstration of the superconductive radix 2 butterfly.

Along with radix 2 butterfly development, we have designed, fabricated, and evaluated at high-speed an RSFQ library of serial multiply-add elements with high throughput suitable for implementation of a variety of DSP processors. To our knowledge, our 6.7 GHz 8-bit multiplier is the fastest multiplier to date.

#### ACKNOWLEDGMENT

The authors would like to thank H. Adler, J. C. Lin, S. Polonsky, S. Rylov, E. Stebbins, and E. Track for useful discussions. Circuits were simulated using PSCAN software provided by SUNY@Stony Brook [10].

### REFERENCES

- [1] L. R. Rabiner, B. Gold, "Theory and Application of Digital Signal Processing," Englewood Cliffs, NJ: Prentice-Hall, 1975.
- [2] K. Likharev and V. Semenov, "RSFQ logic/memory family: a new Josephson-junction technology for sub-terahertz-clock-frequency digital systems," *IEEE Trans. Appl. Superconductivity*, vol. 1, pp. 3-28, March 1992.
- [3] O. A. Mukhanov, "Superconductive single-flux-quantum technology". in: ISSCC Digest of Technical Papers, San Francisco, CA, USA, pp. 126-127, 321, February 1994.
- [4] O. A. Mukhanov, S. V. Rylov, V. K. Semenov, and S. V. Vyshenskii, "RSFQ logic arithmetic," *IEEE Trans. Magn.*, vol. 25, pp. 857-860, March 1989.
- [5] R. F. Lyon, "Two's complement pipeline multipliers," *IEEE Trans. Communications*, pp. 418-425, April 1976.
- [6] A. Moopenn, E. R. Arambula, M. J. Lewis, and H. W. Chan, "Bit-serial multiplier based on Josephson latching logic," *IEEE Trans. Appl. Superconductivity*, vol. 3, pp. 2698-2701, March 1993.
- [7] A. Kidiyarova-Shevchenko, A. F. Kirichenko, S. V. Polonsky, and V. K. Semenov, "New elements of the RSFQ logic/memory family (part 2)," in *Extended Abstracts of ISEC'91*, Glasgow, UK, pp. 200-203, August 1991.
- [8] A. F. Kirichenko and O. A. Mukhanov, "Implementation of novel "push-forward" RSFQ carry-save serial adders," this conference, report EQC-11, Boston, USA, September 1994.
- [9] HYPRES design rules available from Hypres, Inc., 175 Clearbrook Rd., Elmsford, NY 10523, Attn. John Coughlin.
- [10] S. V. Polonsky, V. K. Semenov, and P. N. Shevchenko, "PSCAN: Personal Superconductor Circuit ANalyzer," Supercond. Sci. Technol., vol. 4, pp. 667-670, November 1991.