

# User manual

# Introduction

The Fast Fourier Transform (FFT) is an efficient algorithm for computing the Discrete Fourier Transform (DFT). This Intellectual Property (IP) core was designed to offer very fast transform times while keeping a floating point accuracy at all computational stages. Sundance's core is the fastest and the most efficient available in the FPGA world. Based on a radix-32 architecture, it also saves memory resources compared to other floating point cores available on the market.

# Features

- This IP core targets the following devices:
   ➤ Xilinx: Virtex-II<sup>TM</sup>, Virtex-II Pro<sup>TM</sup>, Spartan-3<sup>TM</sup> and Virtex-4<sup>TM</sup>
- Forward and inverse complex FFT
- Transform sizes: 2<sup>m</sup> with m = 8 to 20 (256, 512, 1024, ..., 1M points)
- Arithmetic type : floating point
- Data formats
  - ➤ IEEE-754
  - > 24-bit mantissa, 8-bit exponent, 2's complement
  - > 14-bit mantissa, 8-bit exponent, 2's complement
  - Any mantissa and exponent precision upon request
- Configurable on the fly forward or inverse operation
- Configurable on the fly transform length
- Fully functional VHDL testbench and the related Matlab functions delivered along the FFT/IFFT core for simulation purposes and specific performance characterization.



# **Functional description**

The Discrete Fourier Transform (DFT), of length N ( $N=2^{m}$ ), calculates the sampled Fourier transform of a discrete-time sequence with N points evenly distributed.

The forward DFT with N points of a sequence x(n) can be written as follows:

$$X(k) = \sum_{n=0}^{N-1} x(n) e^{\frac{-j2\pi nk}{N}}$$
 with k = 0, 1, ..., N-1

#### **Equation 1: DFT**

The inverse DFT is given by the following equation:

$$x(n) = \frac{1}{N} \sum_{k=0}^{N-1} X(k) e^{\frac{j2\pi nk}{N}}$$
 with  $n = 0, 1, ..., N-1$ 

**Equation 2 : Inverse DFT** 

# Algorithm

The FFT core uses a decomposition of radix-2 butterflies for computing the DFT. With 5 different stages, the processing of the transform requires log32(N) stages. To maintain an optimal signal-to-noise ratio throughout the transform calculation, the FFT core uses a floating point architecture with 8-bit exponent for the real and imaginary part of each complex sample. This FFT core employs the decimation in frequency (DIF) method.

This FFT core is designed for FFT computation larger or equal to 1k points and up to 1M points. Since FPGAs memory resources are limited and relatively small, the memory banks used for the processing of the transform are not integrated in the core. External memory, such as QDR SRAM, ZBT RAM, DDR SDRAM or SDRAM is most suited for transforms larger than 16384 points. For shorter transforms, memory banks can likely be implemented inside the FPGA depending on which device is used.



# Data format

This core, when used in combination with Sundance's float converter, is compliant to the IEEE standard 754 for Binary floating-point arithmetic.

Other data formats available for this core are coded in 2's complement for both the mantissa and exponent.

The 8-bit exponent ranges from -128 to 127

The 24-bit mantissa ranges from -8388608 to 8388607

For implementations that require a different bit width p, values will range from  $-2^{(p-1)}$  to  $2^{(p-1)} - 1$ .

The exponent bit width is noted Ebw. The mantissa bit width is noted Mbw.

# **Parameters and Ports definitions**

| Parameter name | type    | Value                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|----------------|---------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| addr_width     | integer | $\geq 8$ and<br>$\leq 20$ | Address width. This parameter (also noted Abw) indicates the width of the address bus for twiddle factors and data. If N is the maximum transform length used for computing the FFT, then Abw=log2(N). Please note that the transform length can be changed on the fly by assigning a new FFT length when restarting the core. However this new transform length cannot be larger than $2^{Abw}$ . Assigning the smallest address width as possible is recommended for achieving higher clock frequencies during synthesis. |

Table 1 : Parameters definition

| Port name | Port width | Direction | Description                                                                                                                                                                                                                                                      |
|-----------|------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| clk       | 1          | Input     | Clock                                                                                                                                                                                                                                                            |
| reset     | 1          | Input     | Asynchronous reset (active high)                                                                                                                                                                                                                                 |
| cke       | 1          | Input     | Clock enable (active high). When low, the clock inside the core is disabled. If forced low, the cke signal must be remain low for at least 4 clock cycles to ensure proper operation of the core.                                                                |
| start     | 1          | Input     | FFT start signal (active high). Start is asserted for one clock<br>cycle to start the core and the address generators. It is only<br>asserted once for continuous data processing (the core will<br>restart automatically every time a transform is complete). A |



|                   |             |            | new start pulse will act as a synchronous reset, will restart the                                    |
|-------------------|-------------|------------|------------------------------------------------------------------------------------------------------|
|                   | 1           | T (        | core and discard the transform that was currently computed.                                          |
| stop              | 1           | Input      | FF1 stop signal (active high). Stop is asserted for one clock                                        |
|                   |             |            | cycle to indicate that the current transform being computed is                                       |
|                   |             |            | the last one. The core will not restart automatically a new                                          |
| 1                 | 1           | 0.4.4      | transform after a stop pulse is received.                                                            |
| done              | 1           | Output     | FFI done signal (active high). A done pulse indicates that the                                       |
|                   |             |            | results of the current transform are ready. The done pulse is                                        |
|                   | -           | · ,        | active on the first active cycle of the result_valid signal.                                         |
| FFIlength         | 5           | input      | FFI transform length. Please refer to the Transform length section of this document for more details |
| FFT nIFFT         | 1           | Input      | FFT direction. High $\Leftrightarrow$ FFT, Low $\Leftrightarrow$ IFFT. This signal is                |
| _                 |             | 1          | registered inside the core on a start pulse.                                                         |
| empty_pipeline    | 1           | Input      | Empty the core pipeline before processing the next FFT/IFFT                                          |
|                   |             |            | pass.                                                                                                |
|                   |             |            | If High, this signal will force the core to wait for all the data                                    |
|                   |             |            | of an FF1/IFF1 pass to be output before the next pass can be                                         |
|                   |             |            | started. This is useful in a configuration where the processing                                      |
|                   |             |            | If I are the EET are will start medius the date from the                                             |
|                   |             |            | If Low the FFT core will start reading the data from the                                             |
|                   |             |            | the previous pass                                                                                    |
|                   |             |            | This signal is registered inside the core on a start pulse                                           |
| tw din addr valid | 1           | Output     | Address valid stroke. This signal indicates that the current                                         |
| tw_din_addi_vand  | 1           | Output     | addresses on tw addr and din addr are valid                                                          |
| tw.addr           | Abw         | Output     | Twiddle factors address bus. This busgives the address in the                                        |
| tw_addi           | 110 W       | Output     | memory where the twiddle factors must be read from.                                                  |
| din addr          | Abw         | Output     | Data input address bus. This bus gives the address in the                                            |
| _                 |             | -          | memory where the input data must be read from.                                                       |
| din_bank          | 1           | Output     | Data input memory bank. This signal indicates which data                                             |
|                   |             |            | memory bank is used as the input bank.                                                               |
| tw                | 2.Mbw+2.Ebw | Input      | Twiddle factors input. This bus should be connected to the                                           |
|                   | or 32 for   |            | memory containing the twiddle factors. The data                                                      |
|                   | IEEE-754    |            | decomposition is as follows.                                                                         |
|                   |             |            | Real mantissa: bits Mbw-1 down to 0                                                                  |
|                   |             |            | Imag mantissa: bits 2.Mbw-1 down to Mbw                                                              |
|                   |             |            | Real exponent: bits 2.Mbw+ Ebw-1 down to 2.Mbw                                                       |
|                   |             |            | Imag exponent: bits 2.Mbw+ 2.Ebw-1 down to 2.Mbw+ Ebw                                                |
| din               | 2.Mbw+2.Ebw | Input      | Data input. This bus should be connected to the input data                                           |
|                   | or 32 for   |            | bank currently used for processing. The data decomposition is                                        |
|                   | IEEE-754    |            | as follows.                                                                                          |
|                   |             |            | Real mantissa: bits Mbw-1 down to 0                                                                  |
|                   |             |            | Imag mantissa: bits 2.Mbw-1 down to Mbw                                                              |
|                   |             |            | Keal exponent: bits 2.Mbw+ Ebw-1 down to 2.Mbw                                                       |
| tw din valid      | 1           | Input      | Twiddle factors data input valid. This signal should be                                              |
| w_um_vanu         | 1           | mput       | asserted high when the data input (din) and twiddle factors                                          |
|                   |             |            | (tw) are valid                                                                                       |
| dout addr valid   | 1           | Output     | Data output address valid strobe. This signal indicates that the                                     |
|                   |             | . <b>r</b> | current address on the dout addr bus is valid                                                        |
| dout addr         | Abw         | Output     | Data output and results address. This bus gives the address in                                       |
| _                 |             | •          | the memory where the output data (dout) must be written to.                                          |



| dout_bank    | 1               | Output | Data output memory bank. This signal indicates which data       |
|--------------|-----------------|--------|-----------------------------------------------------------------|
|              |                 |        | memory bank is used as the output bank.                         |
| dout         | 2.Mbw+2.Ebw     | Output | Data output. This bus should be connected to the output data    |
|              | or 32 for       |        | bank currently used for processing. The data decomposition is   |
|              | <b>IEEE-754</b> |        | as follows.                                                     |
|              |                 |        | Real mantissa: bits Mbw-1 down to 0                             |
|              |                 |        | Imag mantissa: bits 2.Mbw-1 down to Mbw                         |
|              |                 |        | Real exponent: bits 2.Mbw+ Ebw-1 down to 2.Mbw                  |
|              |                 |        | Imag exponent: bits 2.Mbw+ 2.Ebw-1 down to 2.Mbw+ Ebw           |
| dout_valid   | 1               | Output | Data out valid strobe. This signal indicates that the data on   |
|              |                 |        | the dout bus are valid and can be written to a memory bank      |
|              |                 |        | for further processing.                                         |
| result_valid | 1               | Output | Result valid strobe. This signal indicates that the data on the |
| _            |                 | -      | dout bus are the final results of the transform and must be     |
|              |                 |        | written to the results memory bank.                             |

 Table 2 : Ports definition

### Transform length

The FFT transform length is a parameter fed to the core. This parameter can be either constant or can be changed on the fly in order to perform an FFT or Inverse FFT with a different transform length.

The FFT length parameter as well as the FFT direction (FFT\_nIFFT) is registered when a start pulse is sent to the core. In the case the FFT transform length is a constant parameter passed to the core, it is recommended to match the address bit width (addr\_width) with the length N of the transform: addr\_width=log2(N). This will yield the best synthesis results and guarantee an optimal clock frequency for this implementation. In any other case 2<sup>addr\_width</sup> must be bigger or equal to the longest transform length N.

The following table shows the FFTlength code for a given transform length:

| Transform length | FFTlength<br>code | Number of passes<br>through the core |
|------------------|-------------------|--------------------------------------|
| 256              | 00010             | 2                                    |
| 512              | 00011             | 2                                    |
| 1024             | 00100             | 2                                    |
| 2048             | 00101             | 3                                    |
| 4096             | 00110             | 3                                    |
| 8192             | 00111             | 3                                    |
| 16384            | 01000             | 3                                    |
| 32768            | 01001             | 3                                    |
| 65536            | 01010             | 4                                    |
| 131072           | 01011             | 4                                    |
| 262144           | 01100             | 4                                    |
| 524288           | 01101             | 4                                    |
| 1048576          | 01110             | 4                                    |

Table 3 : FFTlength codes



# **Twiddle factors**

The twiddle factors used during the transform computation must be stored in a memory accessible by the FFT core. The twiddle factors for a forward FFT of length N are given by the following equation:

$$Tw(k) = e^{\frac{-j2\pi k}{N}}$$
 with k = 0, 1, ..., N-1

#### **Equation 3: Twiddle factors DFT**

The inverse FFT twiddle factors can be calculated as follows.

 $Tw(k) = e^{\frac{j2\pi k}{N}}$  with k = 0, 1, ..., N-1

#### **Equation 4: Twiddle factors IDFT**

The FFT core package comprises a Matlab program (FFT\_test.m) and subroutines that generate the twiddles factors and write them to a file (fftcore\_twiddle) in the floating point format required.

### Memory

The memory banks are external to the FFT core. Two banks are dedicated to data processing. The signals din\_bank and dout\_bank indicate which bank is used for input and which bank is used for output. Every new pass, the banks are swapped as the FFT core needs to access the data calculated from the previous pass.

### Minimal memory usage architecture

The block diagram below shows a configuration that uses as few memory banks as possible. Please note that a system using dual port memory or QDR SRAM will only require one data bank.



Figure 1 : Minimum memory usage architecture





The output data bank is either A or B. The number of passes through the core will help to determine which one is the output data bank. Table 3 shows the number of passes in function of the transform length. If the number is odd for a given transform length, the FFT results will be in data bank B. If even, the results will be stored in data bank A.

### Streaming IO architecture

A streaming IO architecture is presented below for continuous data processing. Please note that a system using Dual Port Memory or QDR SRAM will only require two data banks.



Figure 2 : Streaming IO memory architecture

Streaming IO processing with concurrent data input and data output requires 5 memory banks to be connected to the FFT core. In this type of architectures, the maximum continuous throughput depends on the number of passes through the FFT engine and the clock rate is it running at. The table below shows how the memory banks are used when performing several transforms in a row.

| Bank     | Pass 1                     | Pass 2      | Pass 3    | Pass 1    | Pass 2       | Pass 3    | Pass 1                         | Pass 2             | Pass 3   |  |  |
|----------|----------------------------|-------------|-----------|-----------|--------------|-----------|--------------------------------|--------------------|----------|--|--|
|          | FFT 1                      | FFT 1       | FFT 1     | FFT 2     | FFT 2        | FFT 2     | FFT 3                          | FFT 3              | FFT 3    |  |  |
| Data A   | Write input data for FFT 2 |             |           | FFT read  | FFT write    | FFT read  | FFT read                       | FFT read FFT write |          |  |  |
| Data B   | FFT read                   | FFT write   | FFT read  | FFT write | FFT read     | FFT write | Read output results of FFT 2   |                    |          |  |  |
| Data C   | FFT write                  | FFT read    | FFT write | Read out  | tput results | of FFT 1  | T 1 Write input data for FFT 4 |                    |          |  |  |
| Data D   | Read out                   | put results | of FFT 0  | Write in  | nput data fo | or FFT 3  | FFT read                       | FFT write          | FFT read |  |  |
| Twiddles | read                       | read        | read      | read      | read         | read      | read                           | read               | read     |  |  |

 Table 4 : Memory banks for a streaming IO architecture



### Memory latency

The FFT core generates the addresses for twiddles factors, data input and data output. The memory latency is calculated as the number of clock cycles it takes between the address is valid on the core address bus and the twiddle factors or data are available at the input of the FFT core. This latency can be up to 15 clock cycles. The FFT core expects the latency to be the same for the twiddle factors and the data input and to remain the same during the transform computation. This latency is automatically calculated inside the FFT core by monitoring the tw\_din\_valid signal (driven high by the user few clock cycles after tw\_din\_addr\_valid goes high).

### Radix-32 vs Radix 2

Sundance's radix-32 butterfly architecture allows the core to be connected to much less memory for the same processing performances than designs with radix-2 butterflies implemented in parallel. The following table shows how much memory is required to perform an FFT in both configurations.

| FFT length | radix-32 memory required<br>(in Mbytes) | radix-2 memory required<br>(in Mbytes) |
|------------|-----------------------------------------|----------------------------------------|
| 256        | 0.02                                    | 0.08                                   |
| 512        | 0.04                                    | 0.18                                   |
| 1024       | 0.08                                    | 0.39                                   |
| 2048       | 0.23                                    | 0.86                                   |
| 4096       | 0.47                                    | 1.88                                   |
| 8192       | 0.94                                    | 4.06                                   |
| 16384      | 1.88                                    | 8.75                                   |
| 32768      | 3.75                                    | 18.75                                  |
| 65536      | 10.00                                   | 40.00                                  |
| 131072     | 20.00                                   | 85.00                                  |
| 262144     | 40.00                                   | 180.00                                 |
| 524288     | 80.00                                   | 380.00                                 |
| 1048576    | 160.00                                  | 800.00                                 |

#### Table 5: Radix-32 vs Radix-2 memory usage

Data throughput=maximum data throughput as shown in Table 7

Using a radix-32 architecture substantially reduces the number of memory resources required. The main benefit is seen at the system level. A single-width PMC module used to perform long transforms with Sundance's FFT core, achieves the same level of processing performances as a radix-2 implementation spread over two 6U CompactPCI boards bundled with multiple FPGAs and memory devices.



# **Resources usage and performances**

The following table summarizes the resources usage and performances of a 24-bit mantissa, 8-bit exponent floating point FFT/IFFT core.

| Device                                | Slices | Multipliers 18x18 | Block RAMs<br>18Kb | Fmax      |
|---------------------------------------|--------|-------------------|--------------------|-----------|
| Virtex-4<br>XC4VLX40 -12              | 12394  | 40                | 36                 | 200.2 MHz |
| Virtex-II Pro<br>XC2VP40 -7           | 12293  | 40                | 36                 | 175 MHz   |
| <b>Spartan-3</b><br><i>XC3S4000-5</i> | 12835  | 40                | 36                 | 105.3 MHz |

| Table 6 : Core resources usag |
|-------------------------------|
|-------------------------------|

The FFT/IFFT processing time with an FPGA internal clock running at 200MHz is shown in the table below.

| FFT/IFFT transform size | Processing time | Sustained throughput<br>in MSPS |
|-------------------------|-----------------|---------------------------------|
| 256                     | 3.68µs          | 69.6                            |
| 512                     | 6.24µs          | 82.1                            |
| 1024                    | 11.4µs          | 90.1                            |
| 2048                    | 31.8µs          | 64.3                            |
| 4096                    | 61.4µs          | 66.7                            |
| 8192                    | 123µs           | 66.7                            |
| 16384                   | 246µs           | 66.7                            |
| 32768                   | 492µs           | 66.7                            |
| 65536                   | 1.31ms          | 50.0                            |
| 131072                  | 2.62ms          | 50.0                            |
| 262144                  | 5.24ms          | 50.0                            |
| 524288                  | 10.5ms          | 50.0                            |
| 1048576                 | 21ms            | 50.0                            |

 Table 7: Core performances



The following graph displays the Signal to Noise Ratio of a Fast Fourier Transform performed over a 1024 points random vector with a 24-bit wide mantissa and 8-bit wide exponent. The software Discrete Fourier Transform was calculated using the FFTw function with a float accuracy (<u>http://www.fftw.org/</u>).



Figure 3: FFT SNR



# **Testbench and Matlab programs**

The FFT core package comprises a VHDL testbench, three Matlab programs and a C program implementing the FFTw functions.

**fftcore\_TB.vhd**: This testbench is designed to work with the FFT core. It reads the twiddle factors from a file ('fftcore\_twiddle.txt') and stores them in the twiddle factors memory bank connected to the core. The input data are also read from a file ('fftcore\_data\_in.txt') and stored in a memory bank that will be accessed by the core once started. Upon the transform completion, the results, available in one of the processing memory banks, are written to a file ('fftcore\_results.txt').

**FFT\_test.m** : This Matlab program generates data and twiddle factors in the floating point format expected by the core (see Data format). The data to be input to the FFT core and the twiddle factors are saved in a text format respectively in the 'fftcore\_data\_in.txt' and 'fftcore\_twiddle.txt' files.

**Analyse\_FFT\_results\_Matlab.m** : This Matlab program reads the output result file ('fftcore\_results.txt') from the FFT core, calculates the expected results with the fft Matlab function and returns the Signal-to-Noise Ratio. The data used for the transform calculation by the Matlab fft function come from the FFT\_test.m program.

**Analyse\_FFT\_results\_FFTw.m** : This Matlab program reads the output result file from the FFT core, reads the FFT results from the FFTw results file and returns the Signal-to-Noise Ratio.

**UseFFTw** : This directory contains the source files and executables of the UseFFTw program that reads the data input for the FFT core ('fftcore\_data\_in.txt') and calculates the FFT results using the FFTw functions (<u>http://www.fftw.org</u>).

Three parameters are expected when executing the program:

FFT length: 256, 512, ..., or 1048576. Data input file name: *fftcore\_data\_in* Data output file name: *fftw\_results* 



The data input file is coded in integer format for the mantissa/exponent and is organized as follow:

Line1: Mantissa RealO Line2: Mantissa ImagO Line3: Exponent RealO Line4: Exponent ImagO Line5: Mantissa Real1 ...

The data output file is coded in float and is organized as follow:

Line1: Real0 Line2: Imag0 Line3: Real1 Line4: Imag1 Line5: Real2 ...

The UseFFTw program can be modified and recompiled by users using Microsoft Visual  $C^{++}$ .



# Waveforms

| Name                       | Value   | · 180 i 200 i 220 i 240 i 260 | ) i 280 i 300 i 320 i 340 i 360 i 380 i 400 i |
|----------------------------|---------|-------------------------------|-----------------------------------------------|
| <mark>⊳ clk</mark>         | 1       |                               |                                               |
| ► reset                    | 0       |                               |                                               |
| ₽ start                    | 0       |                               |                                               |
| <mark>⊳ stop</mark>        | 0       |                               |                                               |
| • done                     | 0       |                               |                                               |
| . ▪ FFTlength              | 06      |                               |                                               |
| ► FFT_nIFFT                | 1       |                               |                                               |
| ● tw_din_addr_valid        | 1       |                               |                                               |
| <b>⊞ =</b> tw_addr         | 00600   | X00000                        |                                               |
| <mark>⊞ ⊅ din_addr</mark>  | 00B80   | X00000                        |                                               |
|                            | 0       |                               |                                               |
| ± ► tw                     | E8E99E  | X X00EA000000400000           |                                               |
| ⊞ ➡ din                    | 06F58CB | X X020E75380B937610           |                                               |
| ► tw_din_valid             | 1       |                               |                                               |
|                            | 0       |                               |                                               |
| <mark>⊞ ⊅</mark> dout_addr | 00000   |                               |                                               |
| ✤ dout_bank                | 0       |                               |                                               |
| <b>⊞ -</b> ∎ dout          | 0000000 |                               |                                               |
| ● dout_valid               | 0       |                               |                                               |
| ● result_valid             | 0       |                               |                                               |

### Start

#### Figure 4: Start

Figure 3 shows how the FFT core must be started. The start signal is driven high for one clock cycle. The first address for the data and twiddles is generated after 7 clock cycles. The user then fetches the twiddles and data in the memory and drives the signal tw\_din\_valid high. A new data and twiddle are then expected every new clock cycle.



| Name                                               | Value   | 41 | 18 | 41. | 20 | ı 41     | ,22 | ı 41     | ,24  | ı   41   | ,26 | ı 41 | ,28 | 41   | 30   | ı 41  | 32    | i 41 | 1,34 | ı 41 | ,36 | ı 41 | .38 1 |
|----------------------------------------------------|---------|----|----|-----|----|----------|-----|----------|------|----------|-----|------|-----|------|------|-------|-------|------|------|------|-----|------|-------|
| <mark>⊳</mark> clk                                 | 1       | Л  |    |     | Ĺ  | ப        | ſ   | ப        | L    | ப        | L   | L    | ப   | L    | ப    | ப     | L     | ப    | U    | T    | L   | J    | UП    |
| ► reset                                            | 0       | _  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ► start                                            | 0       |    |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ► stop                                             | 0       | _  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| - done                                             | 0       | _  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| <mark>⊞                                    </mark> | 06      |    |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
|                                                    | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ■ tw_din_addr_valid                                | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
|                                                    | 00700   | X  | X  | X   | X  |          |     |          | 0000 |          |     | X    |     | X    |      |       |       | X    |      |      | X   |      |       |
|                                                    | 00074   | X  | X  | X   | X  |          |     | <u> </u> |      | X        |     | X    |     | X    |      | X     | X     | X    |      |      |     |      | X     |
| ● din_bank                                         | 1       | _  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ± ► tw                                             | E7E99C  | X  | X  | X   | X  | <u> </u> |     | X        |      | <u> </u> | X   |      | X   | _\00 | EA00 | 00004 | 00000 | X    |      |      |     |      | X     |
|                                                    | 0D0F91F | X  | X  | X   | X  |          |     | 2        |      | X        | X   |      |     | X    | X    | X     | X     | 2    |      |      |     | X    | X     |
| ► tw_din_valid                                     | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ■ dout_addr_valid                                  | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
|                                                    | 006AF   | X  | X  | X   | X  |          | X   |          | X    | X        | X   |      | X   | X    | X    | X     | X     |      |      |      | X   | X    | X     |
| • dout_bank                                        | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
|                                                    | 0D0F4D  | X  | X  | X   | X  |          |     | X        |      | X        | X   |      |     | X    | X    | X     | X     | X    |      |      | X   | X    | X     |
| ■ dout_valid                                       | 1       | -  |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |
| ✤ result_valid                                     | 0       |    |    |     |    |          |     |          |      |          |     |      |     |      |      |       |       |      |      |      |     |      |       |

### Data input memory bank swap

#### Figure 5 : Memory bank swap

When the core requires a new pass to be computed, it needs to get the results data from the previous pass as input data. A pass transition is indicated by an inversion of the din\_bank signal. This signal can be used to multiplex the memory banks connected to the core during processing.





### Continuous processing between two consecutive passes

Figure 6 : empty\_pipeline low

When a pass transition occurs, the din\_bank and dout\_bank signals are inverted. However, due to the core latency, the dout\_bank signal is inverted after the din\_bank signal, when all the data for the previous pass have been processed through the core. Forcing the empty\_pipeline signal low when starting the core will enable to continuously process data through the core without pausing between two consecutive passes. As a result the core will need to access the same memory bank for read and write operations simultaneously. Therefore, if this mode is used, the processing memory banks connected to the core must be dual port.



### Halted processing between two consecutive passes

| Name                                               | Value   | 1 43  | ,35 ı | 43,40 |        | 43,45  |    | 43,50 | 1  | 43,55 |    | 43,60 | 1  | 43,6  | 5 I  |
|----------------------------------------------------|---------|-------|-------|-------|--------|--------|----|-------|----|-------|----|-------|----|-------|------|
| <mark>⊳</mark> clk                                 | 0       | M     | W     | w     | M      | M      | ЛЛ | M     | ЛЛ | M     | ЛЛ | M     | ЛЛ | лЛ    | nn   |
| <mark>⊳</mark> reset                               | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ► start                                            | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ► stop                                             | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| - done                                             | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| <mark>⊞                                    </mark> | 06      |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
|                                                    | 1       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ■ tw_din_addr_valid                                | 1       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ► empty_pipeline                                   | 1       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
|                                                    | 00700   | 00000 |       |       | xx     | XXX    | XX | 0000  | xx | 000   | 00 |       | XX | X0000 | •X   |
| ⊞ <del>•</del> din_addr                            | 00074   |       |       |       |        |        |    |       | XX | 000   | CC |       | XX | 00    | 0000 |
| • din_bank                                         | 1       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ± ► tw                                             | E9E9A5  | XXXX  |       |       | xx     | XXX    | XX |       | xx |       |    |       | XX | 00    | 0000 |
|                                                    | 080F5FB |       |       |       |        |        |    |       |    | 000   | CC | 000   | XX | 00    | 0000 |
| ► tw_din_valid                                     | 1       |       |       |       |        |        |    |       |    | ſ     |    |       |    |       |      |
| ■ dout_addr_valid                                  | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| <mark>⊞ =</mark> dout_addr                         | 00FFF   | XXXX  |       |       |        |        |    |       |    |       |    |       |    |       |      |
| ● dout_bank                                        | 1       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| <mark>⊞ =</mark> dout                              | 0E0BA0  | XXXX  | XXX   |       | AOA9C5 | 9478D8 |    |       |    |       |    |       |    |       |      |
| <mark>-● dout_valid</mark>                         | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |
| -● result_valid                                    | 0       |       |       |       |        |        |    |       |    |       |    |       |    |       |      |

Figure 7 : empty\_pipeline high

When the empty\_pipeline signal is driven high, the core will pause the processing between two consecutive passes in order to empty the data pipeline. As shown on the waveform above, a new pass is started only when all the data from the previous pass have been processed through the core and written to memory. This mode should be used when the data processing memory banks are single port.



### Results

| Name                | Value   |    | 1 84 | l,34 | 1 84 | ,36 | 1 84 | ,38 | 84 | ,40      | 1 84     | ,42 | 1 84 | ,44 | 1 84 | 4,46 i |
|---------------------|---------|----|------|------|------|-----|------|-----|----|----------|----------|-----|------|-----|------|--------|
| <mark>⊳</mark> clk  | 1       | Г  | ப    | U    | J    | ப   | ப    | L   | ப  | J        | ப        | ப   | ப    | ப   | J    | J      |
| ► reset             | 0       |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ► start             | 0       |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ► stop              | 0       | _  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| -• done             | 0       | _  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
|                     | 06      |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ► FFT_nIFFT         | 1       | -  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ● tw_din_addr_valid | 1       | -  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
|                     | 007D0   | X  |      | X    |      | X   | X    | X   | X  |          | <u> </u> | X   | 7    |     |      |        |
| ⊞ = din_addr        | 000B7   | DC | X    | X    |      | X   | X    | X   | X  |          | X        | X   | X    | X   |      |        |
| - din_bank          | 0       |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
|                     | E8E9BA  |    | X    | X    | X    | X   | X    | X   | X  | X        | X        | X   | X    | X   |      |        |
| ⊞ 🖻 din             | 1212614 |    | X    | X    | X    | X   | X    | X   | X  | <u> </u> | X        | X   | X    | X   | X    |        |
| ► tw_din_valid      | 1       | -  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ■ dout_addr_valid   | 1       | -  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ⊞ = dout_addr       | 00F67   | X  | X    | X    | X    | X   | X    | X   | X  | X        | X        | X   | X    | X   | X    |        |
| - dout_bank         | 0       | _  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| ⊞ =● dout           | 1212A92 |    | X    |      |      | X   | X    | X   | X  | X        | X        | X   | X    | X   |      |        |
| ● dout_valid        | 1       | -  |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
| P result_valid      | 0       |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |
|                     |         |    |      |      |      |     |      |     |    |          |          |     |      |     |      |        |

#### Figure 8 : Results

When the last pass of the algorithm is processed, the data coming out of the core are the results of the transform. These results are in a non-sequential order and must be written in memory at the addresses given on the dout\_addr bus. The transform results are stored in memory in a bit-reversed order.



### Done

| Name                        | Value   | 125       | 5.30 | 12    | 5.32  | 12 | 5.34 | 125 | 5.36 | 1 12 | 5.38 | 1 125 | 5.40 | 125 | .42 | 12 |
|-----------------------------|---------|-----------|------|-------|-------|----|------|-----|------|------|------|-------|------|-----|-----|----|
| <mark>⊳</mark> clk          | 0       | L         | Ľ,   | L     | U     | U  | ப    | U   | U    | U    | U    | U     | U    | UT  | U   | ப  |
| ► reset                     | 0       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ► start                     | 0       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ₽ stop                      | 0       | _         |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| - done                      | 0       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ⊞ 🖻 FFTlength               | 06      |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ► FFT_nIFFT                 | 1       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ● tw_din_addr_valid         | 1       | -         |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ⊞ =• tw_addr                | 00500   | <u>کر</u> | X    | X     | X     | X  | X    | X   | X    | X    | X    |       | 0000 |     |     | X  |
| ⊞ = din_addr                | 003D0   |           | X    | X     | X     | X  | X    | X   | X    | X    | X    | X     | X    | X   | X   | X  |
| • din_bank                  | 1       | -         |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ± ⊶ tw                      | E6E99B  |           | EA00 | 00004 | 00000 | X  | X    | X   | X    | X    | X    | X     | X    |     | X   | X  |
| ⊞ 🖻 din                     | F50D8A  |           | X    | X     | X     | X  | X    | X   | X    | X    | X    | X     | X    | X   | X   | X  |
| ⊷ tw_din_valid              | 1       | -         |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ■ dout_addr_valid           | 1       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ⊞ = dout_addr               | 00FE2   |           | X    | X     | X     | X  | X    | X   | X    |      |      |       |      | X   | X   |    |
| - dout_bank                 | 1       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| ⊞ -● dout                   | 1413B5E |           | X    | X     | X     | X  | X    | X   | X    | X    | X    | X     | X    | X   | X   | X  |
| ● dout_valid                | 0       |           |      |       |       |    |      |     |      |      |      |       |      |     |     |    |
| <mark>⊸</mark> result_valid | 1       | -         |      |       |       |    |      |     |      |      |      |       |      |     |     |    |

#### Figure 9 : Done

After the last result data has been output from the core, the done signal is high for one clock cycle, indicating the completion of the transform. A new transform is then processed through the core.