Estimating FPGA Requirements for DSP Applications

Introduction

The HERON-FPGA family offers a wide range of FPGAs with I/O interfaces. They can be used for many purposes, ranging from building complex I/O subsystems to doing signal processing and implementing data buffers. This white paper gives some rough guidance on which FPGA is required for your application.

There are two types of application – I/O, and processing. These are dealt with separately.

I/O Applications

All applications will use some amount of I/O – even if only to transfer data to the HERON FPGA system. Other applications will implement solely I/O, with the FPGA used to implement functionality like:

RS232 interfaces, via a UART
Synchronous serial interfaces (USART)
PWM outputs
Digital control interfaces

For these, the numbers of gates used are usually trivial – typically <5K per function, or in Virtex terminology, about 20-30 CLBs. The smallest FPGA in the HERON-FPGA range is a 200K-gate device, which is plenty large enough for these functions… however, while gate count is not a significant worry, pin count is!

You also need to choose the FPGA family to ensure that you can connect the “type” of I/O signals that you need to. For example if you need Low Voltage Differential Signalling (LVDS) signals, you must choose the Virtex II because  Virtex does not support that. However, if you need to connect TTL, the Virtex II is no use because it is NOT 5V tolerant – connecting TTL directly to it can destroy the FPGA chip unless you have resistors fitted at build time to make it safe!

Specify these systems by considering the number of pins of I/O you need. Look at any special buffering requirements. It may well be that you are forced to use a larger FPGA than is required, simply to get enough I/O pins or the right type of buffering.

One exception exists – where the FPGA is used to build FIFOs or buffer memories. These are common to both I/O and processing nodes so will be discussed separately later.

Processing Data

In data processing applications like filters, transform/convolution engines and so forth, the FPGA is often used to implement large blocks, linked together in a pipeline. This gives two main considerations:

How big is each block?
Where is the data stored?

Neither can be finally answered without placing components into the Xilinx tools. However, the following approximate rules will help. It is possible to design Virtex systems which greatly exceed these, but this will give a good first estimate of what can be done without handcrafting the design.

First, we need to make some assumptions about the maximum rate for any processing element. As a rule of thumb:

Virtex Virtex II
8bit  200MHz 270MHz
16bit  160MHz 240MHz
32bit 120MHz 210MHz

Now, calculate the number of multiplies in each block of your system – for example, if there is a FIR, you may need a multiply for each of the taps. Multiply the number of taps by the sampling rate to give you the number of multiplies/second required.  Divide this by the multipliers' speed (e.g. 160MHz for 16bit) to see how many multipliers are required. Always round this number up! A point worth note here is that the Virtex II architecture has hard coded multipliers that are 18x18 and run at about 100Mhz. These are in addition to the programmable gates. This means that a multiplier hungry application will benefit from the Virtex II architecture. The 1M gate Virtex II part has 40 18x18 multipliers – this means you should never need to use the CLBs for multipliers in those parts.

For the other blocks, the best estimate is always to use the core generator to build a library element, and check how big that is. This will only take a few moments, and can give a very accurate view of the system.  Note that the core generator may be able to exploit efficiencies that we can’t see. This can result in the core being much smaller than expected.

Add up all the multipliers, and add to this the number of blocks in preconfigured cores. The end result is starting to give you a good idea of the array size required; but, as a final step, consider the speed.

If your design involves running the multipliers at <50MHz, assume you can use 80% of the FPGA. On the other hand, if the multipliers are running at >50MHz, assume that the utilisation is only 50% of the device. The reason for this is simple – when you start to optimise the design for speed, you will use Relationally Placed Macros (RPM). These use a fixed layout, and it may not be possible to fit the blocks together as efficiently as in the slower case.

Finally, remember that while we have been counting CLBs, it is likely that some of your cores will use the Block RAMs. Make sure you don’t use more than the device has!

Device Gates CLB Array Block Ram Distributed Ram 18x18 multipliers
XC2S200 200K 28x42 = 1176 14 blocks, 7Kbytes 9.2Kbytes  
XC2V1000 1M 40x32 = 1280 40 blocks, 90Kbytes 20Kbytes 40
XC2V3000 3M 64x56 = 3584 96 blocks, 216Kbytes 56Kbytes 96
XC2V6000 6M 96x88 = 8448 144 blocks, 324Kbytes 132Kbytes 144

Storing the Data

Often an FPGA will be used to process arrays that are too large to be stored on-chip. As an example, processing images. In these cases, some access to off-chip bulk storage will be required. This was not implemented on the first FPGA modules, but is planned for future products.

Here the danger is assuming that the processing is the bottleneck. Sometimes it is; but often the bottleneck will be getting data onto the chip. As an example, consider applying a 9x9 filter to a 2K*2K image, using 32-bit greyscale pixels. This type of operation may be performed in medical image processing.

Implementing the 9x9 filter can be performed using 81 multipliers. This would force us to use a very large FPGA! However, it would be possible to clock this array at around 100MHz. Assuming that we can store already-read data within the FPGA, the memory interface "only" needs to provide 9 32-bit pixels at 100MHz – or an aggregate transfer rate of 3.6Gbyte/second.

If the data was stored in (say) 100MHz SDRAM, we would need 9 32-bit buses, or a single bus 288 bits wide to read the data fast enough.

While this is an extreme example, it is worth checking that the data rate you need can be supported by the module selected. Also, check that the data format is compatible with the memory – SDRAM will not achieve its full performance where accesses are random. Ideally, keep all the data on-chip for best performance!

Buffering the Data

For many applications, it may be desirable to use the FPGA to implement buffers or additional FIFOs. This could be used to allow a processor to load a pattern into an I/O module, and have that pattern repeatedly played out via a DAC; or to allow the processor to send an entire message to RS232, without waiting for the data to be sent. Buffering can also be used to help processors sustain high data rates, allowing an ADC to continue sampling even when the processor is servicing interrupts.

In this case, we must add the overhead of building a FIFO or buffer to our requirements. Virtex FPGAs are good at this - they have dedicated RAMs which may be used while the logic cells can also be used to create buffering.

FIFOs can be created in either memory type. As a rough guide to the maximum available, consult the table above. Note that each block RAM represents 512bytes – this represents the granularity with which you must use them. Also, utilising the whole device as distributed RAM means there is no space for any logic…

Example

For a filtering application, we need:

  • Interface to an ADC
  • 16-tap filter, 80MHz, 16-bit precision
  • 10Kbyte FIFO / buffer
  • Interface to FIFO

To implement this, a first approximation would be:

Function Description CLBs / Block RAM CLBs / multipliers / block RAM Virtex II
ADC Interface General I/O <30 <30
Filter 16 taps @ 80MHz = 1280 multiplies/second
160MHz multiplier = 8 multipliers required
16 bit multipliers = 80 CLBs each
640 CLBs 16 multipliers
FIFO 4Kbyte FIFO
512bytes/block RAM = 8 block RAMs
<5 CLBs / 8 BRAMs <5 CLBs / 8 BRAMs
HEART Interface General I/O <30 <30
    c. 700 CLBs
8 BRAMs
c. 65 CLBs, 16 multipliers,

8 BRAMs

From this approximation, we can see that with 100% utilisation, we could fit this into the XC2S200 (700 CLBs versus 1176 available, 8 Block RAMs versus 14 available). This gives a utilisation of around 59% of the FPGA, which is a comfortable margin. 

However, when this design was placed using the Core Generator to build the filter, the design used less than 600 CLBs, and was placed easily in the XC2S200. The design was significantly smaller than our estimates as it used a more efficient implementation of the filter (bit-serial filter) which mapped to the FPGA architecture better. This also reflects the conservative nature of the figures we used.

This example shows exactly how you should use this document – it will give you a first approximation, but it is entirely likely that the final design will require an FPGA array either one size larger or one size smaller than that calculated. Take the figures calculated here as a first estimate only!