Floating Point Controller as PicoBlaze Network on Single ...



Floating Point Controller as a PicoBlaze Network on a Single Spartan 3 FPGA

Jiří Kadlec1, Roger Gook2

1Institute of Information Theory and Automation,

Academy of Sciences of the Czech Republic, Prague, CZ

Tel: +420 2 6605 2216 Email: kadlec@utia.cas.cz

2Celoxica Ltd. Abingdon Oxford, UK

Tel: +44 1235 863656 Email: roger.gook@

Introduction

A single FPGA implementation of a large class of DSP floating point algorithms can be simplified by breaking it down into smaller manageable HW accelerated processes interfaced by dedicated dual-ported block RAMs, and controlled with multiple dedicated processor controllers organized as one master and several workers in a star topology.

This paper documents our current experiences with such a topology, and suggests a design strategy based on Simulink bit exact modeling of DSP primitives. We finally report the measured power consumption of a 400MFLOP design implemented on a single Virtex 2 XC2v1000-4 and compared with the estimated power consumption of the same design on a 90nm Spartan 3 XC3S1000-4 part and on a new low-power version of the Spartan 3 XC3S1000L-4 part.

We have used the PicoBlaze KCPSM3 processor designed by Ken Chapman, Xilinx. This compact VHDL core can be downloaded for free from . The KCPSM3 distribution includes an assembler, RS232 macros with FIFOs and a uart_clock demo.

The KCPSM3 processor is a simple 8bit CPU optimized for the Virtex 2, Virtex 4 and Spartan 3 FPGA families equipped with 18 kbit block RAMs (BRAM). The processor uses 1 1024x18 BRAM for program storage and only 96 FPGA slices. The KCPSM3 version of the PicoBlaze core has 16 registers, 64 bytes of scratch pad memory, interrupts, fixed-size stack and a very simple 8bit I/O bus. All instructions execute in exactly 2 clock cycles, and there is a HW support for interrupt handling.

| |

Fig. 1: PicoBlaze-based Architecture for Floating-point DSP

Processor Network on a Single FPGA

Our architecture uses 1 master PicoBlaze processor and 4 simplified PicoBlaze worker processors (see Fig. 1).

Master

The master PicoBlaze software uses the time base with 1 microsecond resolution and the RS232 UARTSs from Ken’s uart_clock demo.

The master is connected with each of the workers by I/O mapped BRAMs organized in 2048 8bit words. In general, the master and each worker can use the BRAMs in parallel and RD as well as WR operations can be performed at any time by both processors from and to any address.

Master-Worker Interface

The constant execution time (2 clock cycles per instruction) and therefore predictable synchronous operation enables to implement HW protection against simultaneous WR operations with both processors writing to the same address. Within two consecutive clock cycles both processors can write to the same address. The dual-ported memory that connects each worker with the master is organized in 8 banks of 256 bytes. If the master and a worker write at the same time to the same address in different banks, there is no conflict.

However, our experience indicates that programs of both processors have to cooperate. Flags (reserved memory locations) synchronize both programs. A flag is set by one processor and monitored by the second. Therefore an additional HW protection mechanism is not implemented in the reported designs.

Assembler and Program Download

The PicoBlaze assembler compiles into a VHDL code, which represents the content of the program BRAM. The PicoBlaze architecture uses only a single port, therefore the second port can be used to download program(s) into one or all embedded processors. This removes the time-consuming VHDL synthesis and place & route step after each modification of the program. However, the logic that services downloads depends on the development board, and the associated logic is board-specific. This would spoil partially our comparison of power consumption; therefore we focus on the simplest solution: The assembler creates a VHDL description of BRAM contents for each PicoBlaze, and the complete VHDL design is recompiled into one single bitstream.

In the designs presented in this paper the main task of the master is to provide the real-time base for the complete system as well as to interpret commands from the user and send the results to PC via the serial line. We have implemented and used mainly the hexadecimal memory dump functions, providing visibility of selected memory locations of BRAMs that connect the master with all four workers.

The second key task of the master is to coordinate and reconcile the partial results generated by the workers, including the use of the real-time base with 1 microsecond resolution.

PicoBlaze Workers and Floating Point Accelerators

Each worker has a simple interface to a dedicated DSP HW accelerator. The interface supports data words wider than 8 bits, and provides sufficient bandwidth.

In our design this interface is implemented by three dual-ported BRAMs X, Y, Z organized in 1024 18bit words. Each worker can read and write to each of these 3 BRAMs via the 8bit I/O mapped interface. Naturally, this needs some temporary address and data registers, and a single read or write takes several instructions of the worker.

On the other hand, the DSP HW can access the BRAMs (both in RD and WR operations) in a single clock cycle from the second port of each dual-ported BRAM.

HW Acceleration and Problems with Reuse

The workers serve as controllers for dedicated floating-point HW accelerators. The performance of the FPGA comes from multiple accesses to the optimized BRAMs and optimal word lengths. Another necessary condition for reasonable DSP performance is batching of pipelined operations.

Parts of the FPGA logic HW can be reused in several contexts with the use of multiplexers. Yet this approach has certain limits, because the size of the multiplexers grows quickly, which effectively limits the number of HW paths that can be switched. And this is the place where the concept of the master-worker micro-controllers can help.

An Example: A Product of Two Floating-Point Matrices

Many common DSP and basic algebra algorithms can be split into few primitives. An example is a product of two matrices, which can be decomposed into a sequence of vector-by-vector product primitives. These primitives keep the basic feature of limited-complexity batch operations starting in a BRAM, performing a relatively simple sequence of pipelined operations at the maximal clock speed, and returning the result(s) back to another BRAM. These primitives can be mapped effectively to HW, including the autonomous data-flow control in HW.

To implement a vector-by-vector product primitive the worker sets proper addresses for inputs (BRAMs X and Y), defines the length of the vector, and sets the address for the scalar floating-point result output (BRAM Z). The worker sets these data by writing to dedicated BRAM locations accessible by the HW controller. Once this is done, the HW accelerator performs the independent vector-by-vector product computation. The worker can do other preparatory tasks for the next batch. Once the HW accelerator finishes the computation and the results are written to the BRAM Z, it issues an interrupt to the worker (acting as a HW controller) to inform it that the batch has finished.

The job of the worker is to combine the small basic parts of computation into a complete algorithm (floating-point multiplication of two matrices in this example).

The PicoBlaze controller can handle this level of algorithmic design better in SW. The HW needed for multiplexed reuse of shared HW (like floating-point adder or multiplier) is not growing with the growing complexity of the algorithm.

Let us make clear that few such primitives can coexist and share the same HW resources. The worker can start a batch doing vector by vector or another batch performing just single floating-point addition, an addition of two vectors, etc.

In addition, the floating-point units are pipelined. Therefore in the end of the batch even a simple vector-by-vector product needs a wind-up stage where the latency of the floating-point adder is taken into account, the HW can control the final sum of the partial sums. This operation needs to reconnect for few final clock cycles the adder to a slightly different context. This is done by HW multiplexers.

Scalable, Short-Latency Floating-Point Modules

Our laboratory uses for DSP research a set of floating point modules with these precisions: 18m11, 24m17, 32m23 and 36m27. The first number indicates the word length, and the last number indicates the number of bits for the mantissa. The precision 18m11 has been used in the designs described in this paper to exploit the 18bit wide BRAMs of Spartan 3, Virtex 2 and Virtex 4. This floating-point format works with 11 bits for mantissa, 6 bits for exponent and 1 bit for the sign. All modules handle special cases like zero, underflow, overflow, NAN and their combinations.

All the floating-point ADD, MULT, FIXPT2FLOAT and FLOAT2FIXPT modules are derived from the original single-cycle Handel-C floating-point library macros included in the Celoxica DK 1.1 tool.

These macros have been re-compiled in Celoxica DK4 into VHDL modules, wrapped with registers at inputs and outputs, and then retimed by Synplicity Synplify Pro 8.1 into pipelined EDIF modules with increased throughput. This approach resulted in short-latency modules optimized for cooperation with PicoBlaze (the target has been 50MHz for slow version of Spartan 3). The modules can invert each of the input operands (mainly to implement the SUB operation on the same HW).

The ADD and MULT modules have a latency of 2 clock cycles. The 18m11 and 24m17 MULT module uses only one embedded 18bit multiplier, while the MULT module in the 32m23 or 36m27 format needs four 18bit multipliers.

FIXPT2FLOAT and FLOAT2FIXPT have a latency of 4 clock cycles. These modules include a conversion from/to a 32bit wide signed fixed-point number with a selectable position of the binary point. Both modules also include exception handling.

To stay compact the DIV and SQRT modules are implemented as sequential state automata. These two modules cannot be pipelined.

All the operations are bit-exact compatible with the Celoxica floating-point arithmetic format. The format applies no rounding, and there is no de-normalization of mantissa in the case of underflow.

Simulink and Handel-C based Design Methodology

In our approach, the complete floating-point algorithm is decomposed into a sequence of simple batches. The HW accelerators implement these DSP primitives autonomously. Combinations of Simulink with Handel-C enable us to quickly develop and debug different HW batch accelerators.

The methodology is based on bit-exact Simulink models, using C++ S-function models of floating-point units generated from the Handel-C code by the Celoxica DK4 tool (see Fig. 2).

The Matlab/Simulink bit-exact models are created from the DK-4 Handel-C simulator in the debug mode. The simulator generates internally a C++ model of the HW implementation, and compiles it into an object file by executing a standard C++ compiler; the result can be linked to a Matlab/Simulink S-function as a single DLL library.

This library can be used from Simulink on any PC with Windows 2K/XP, and does not require the DK4 tool.

|[pic] |

| |

Fig. 2: Simulink bit-exact model of a vector-by-vector product HW accelerator

The Handel-C source code design simply calls one of the original Celoxica scalable floating-point macros compiled in a given precision (18m11). Simulink interprets the data-flow diagram sequentially. Different Simulink floating-point blocks that model in a bit-exact fashion the function of the EDIF modules share one single DLL library. The top level of these modules simply passes data from Simulink to the DLL library together with a selection parameter that selects the path to the correct executable binary model of the original Handel-C floating-point macro.

At the Simulink level the data representation is implemented by representing each floating-point number as a vector of three double-precision numbers: sign, exponent and mantissa (in fact the numbers are unsigned integers with the width 1 bit, 6 bits and 11 bits in the case of the 18m11 floating-point format). The double-precision format requires minimal specialized knowledge of Simulink. This helps to keep things simple.

The precision of all modules used in the Simulink model can be changed easily. The identical Simulink model must be opened in a new directory with another DLL library that represents the collection of the bit-exact models with a different precision.

Design Steps

The design and verification strategy for the floating-point DSP batch modules for the PicoBlaze network can be described as follows.

Step 1: Design in Simulink

A model of the batch operation is designed in Simulink. The general source blocks represent the input BRAMS X and Y. The output represents the results of the batch operation in the BRAM Z. It is often helpful to add a double precision equivalent of the DSP batch from the standard Simulink blocks and to compare both results in the same diagram to see the errors coming from the limited precision of the bit-exact model. The Simulink simulation results can be stored in Matlab variables and converted by very simple M-function Matlab scripts into a format suitable for the Celoxica DK4 HW simulator or a direct download to the PicoBlaze network.

Step 2: Convert the Model by Hand into a Handel-C Testbench Template

A simulation of the floating point HW batch with the test vectors from step 1.

The data flow graph represented in step 1 has to be converted by hand into a Handel-C code that connects the EDIF floating-point blocks and BRAMs.

Such a hand-coded Handel-C testbench can be debugged and traced at the register level in the Celoxica DK4 software simulator. A simple set of I/O functions helps to automate the connection to data files generated from the Simulink model in step 1. The simulator supports the BRAMs. In the simulation the FP modules are replaced by calls to the Handel-C floating-point library of bit-exact functions, wrapped by two registers to get an identical latency as the optimized and retimed EDIF FP modules.

Step 3: Test on a Real Hardware

Compilation of the Handel-C testbench in DK4 to a HW kit to verify the function on real HW. We have designed specific versions of I/O functions that automate the connection of the HW kit to the Matlab environment without the need to modify the Handel-C code of the DSP test-bench. The Celoxica RC200E board with Virtex 2 XCV1000-4 has been used. It is connected to a PC via a parallel port, and there is a simple set of C functions for booting the FPGA and a subsequent use of the same parallel port for data exchange with a PC.

Step 4: Create a Reusable Module

Cut and paste the debugged HW accelerator (BRAM -> do a DSP batch -> BRAM) from the Handel-C test-bench. DK4 can compile it into a reusable EDIF module.

Step 5: Connect the Reusable Module to Worker BRAMs

Integrate the verified DSP modules (HW accelerators from step 4) to the DSP ports of the X, Y and Z BRAMs of the PicoBlaze worker. The top level for the integration of the PicoBlaze network can be Handel-C, VHDL or Verilog. Handel-C and the DK4 compiler have been used in the designs presented in this paper.

Step 6: Test the Function of the Module under the Worker Program Control

Test the function of a single DSP module and the corresponding batch operation first on one PicoBlaze worker with the memory dump support from the master. Use the test vectors from the high-level Simulink testbench (step 1) in the format suitable for the PicoBlaze network. Here we verify the correct function of the PicoBlaze worker assembler code with the HW accelerator.

Step 7: Develop the DSP Design in Assembler and Debug it by Memory Dumps

Integrate the DSP design to combine multiple batch runs on the same processor (like the matrix product), and expand it to include multiple workers synchronized by the master, large data sets and real-time constraints. The debugging tool is the memory dump support of the master PicoBlaze. Concentrate on SW to manage and optimize the combination of the DSP batch processes on a single worker as well as on the PicoBlaze network.

Benefits

The PicoBlaze workers help to get more generic and more flexible floating-point algorithms without an additional increase of the HW complexity due to the irregularities and complex multiplexing structures of the reused HW units.

Last but not least, the definition of the algorithm in the PicoBlaze worker SW gives the developer the flexibility of functions and function calls (even if in the assembly language only). Therefore one might get with the same set of DSP primitives on one worker not only the matrix-by-matrix product, but also the HW-supported sum or the difference of two floating-point matrices. Another HW acceleration batch can perform

matrix transformation like X -> X’, or move a matrix from the BRAM Z to the BRAM X or Y.

Floating Point DSP Performance Examples

The HW batch that supports the vector-by-vector product consists of one FP multiplier and one adder operating at 50MHz and delivering up to 100 MFlops.

N+15 clock cycles are needed to compute Z[i]=X’[1:N]* Y[1:N] vector product.

The 15 clock cycle overhead comes from:

• Reading information about the X, Y, Z addresses and reading N (2 clock cycles – ccs)

• Respecting the pipeline of the FP multiplier (2 ccs), FP adder (2 ccs) and registers that connect them (2 ccs)

• Final winding up of the product by sum of partial products (4 ccs)

• Writing the result to the Z BRAM (1 cc) and the PicoBlaze worker is interrupted in the final 2 clock cycles.

This analysis indicates the importance of the short latency of the FP modules. Even with this short latency and no additional delay coming from the PicoBlaze program, the HW accelerator delivers:

• 50 MFLOPs for the length of vectors N=15

• 90 MFLOPs for the length of vectors N=150

• 98 MFLOPs for the maximal length of the vector N=1000

In the case of matrix-by-matrix product for two square matrices N x N and N=15 the accelerator delivers 50 MFLOPs. In parallel with the HW the PicoBlaze can execute up to 15 instructions (30 clock cycles) to set next memory pointers for each of the 225 HW accelerated batches that form the matrix by matrix product in the case of 15 x 15 floating point matrix.

This is in total 6750 clock cycles. One such matrix product is done in 135 microseconds. Seven such matrix products can be done in one millisecond.

The floating-point matrix addition, subtraction, transposition of a matrix in the same BRAM, or moving a FP matrix from BRAM Z to BRAM X or Y take M x N + 10 clock cycles for an M x N matrix. This is 235 clock cycles or 4.7 microseconds when M=N=15. Each BRAM can store up to 1024 18bit floating-point words.

Power evaluation and comparison of Virtex 2 with Spartan 3(L)

To evaluate power consumption of the PicoBlaze network architecture with five PicoBlaze CPUs and FP accelerators for the Virtex 2, Spartan 3 and Spartan3L (low power) architectures we will concentrate on this specific test case:

A 1024 x 1024 vector product implemented on 4 PicoBlaze workers, each performing a 256 x 256 part of the total product in parallel.

In this case each PicoBlaze worker delivers 94 MFLOPs, and the total sustained performance of the network on a single FPGA is 376 MFLOPs. One product is done in 271 clock cycles; this corresponds to the maximal sampling frequency of 184 kHz. The combined 1024 x 1024 vector product is used as a finite-impulse-response (FIR) filter.

The left column of the table in Fig. 3 presents the design size and the measured power consumption for a running design that performs continuously the combined 1024 x 1024 vector product used as a FIR filter (376 MFLOPs 18m11 arithmetic). In total there are 4 FP adders and 4 FP multipliers. The input/output data are stored in the 18m11 floating-point format in the DSP BRAMs of the workers. The designs currently do not perform the conversion from/to the fixed-point data format.

The master maintains the real-time base with 1 microsecond resolution and provides the memory dump facility.

The complete designs use only the 50 MHz clock input and two pins for the RS232 interface with PC. This arrangement has been selected to isolate the Virtex 2 design from the rest of the board and to remove the I/O connections to the external memory. The Virtex 2 implementation has been tested on the Celoxica RC200E board. Power consumption of the chip has been measured indirectly by the measurement of the package temperature. This measurement helped us to set properly the Xilinx XFLOW power-modeling tool.

The same setting of the XFLOW tool has been used to estimate the power consumption of the identical design, this time compiled for the 90 nm Spartan 3 and Spartan3L FPGA in the same package as the Virtex 2, for results see the right column of the table in Fig. 3.

|Virtex2 xc2v1000 - 4-fg456 |Spartan3 xc3s1000 (L) - 4-fg456 |

| | |

| | |

|Slice Flip Flops 2905 28% |Slice Flip Flops 2637 17% |

|4 input LUTs 4241 42% |4 input LUTs 4424 28% |

|Occupied Slices 3292 64% |Occupied Slices 3097 40% |

|BRAMS 21 52% |BRAMS 21 87% |

|MULT18x18s 4 10% |MULT18x18s 4 16% |

|Clock 50 MHz ISE: 53,3 MHz |Clock 50 MHz ISE: 50,6 MHz |

| | |

|Power (Xpower setting verified |Power estimate Spartan 3 3L |

|by measurement of case tempr.) | |

|Vccint Dynamic 666 mW |Vccint Dynamic 93 mW 91 mW |

|Quiescent 18 mW |Quiescent 78 mW 36 mW |

|Vccoux Dynamic 0 mW |Vccoux Dynamic 0 mW 0 mW |

|Quiescent 330 mW |Quiescent 62 mW 62 mW |

|Vcco Dynamic 3 mW |Vcco Dynamic 1 mW 1 mW |

|Quiescent 3 mW |Quiescent 0 mW 0 mW |

| | |

|Total 1020 mW |Total 234 mW 190 mW |

Fig. 3: Area/Power 376 MFLOP (18m11) FIR filter, 1024 taps, 5 PicoBlaze CPUs

|[pic] |[pic] |[pic] |

| | | |

|Virtex 2: xc2v1000-4 |Spartan 3L: xc3s1000(L)-4 |Spartan 3E xc3s1200E-4 |

|Package fg 456 |Package fg 456 |Package fg 400 |

Fig. 4: Placement for Virtex 2, Spartan 3, 3(L) and the low cost Spartan 3E.

Lessons Learned

The design strategy path “Simulink model -> DK4 debug -> HW debug -> Reuse with PicoBlaze” for implementation of floating-point DSP modules proved to work very well.

The PicoBlaze core is really small and simple, hence manageable without the need to combine too many software packages from different vendors.

Spartan 3 reduces 4-times the power consumption compared to Virtex 2. The power reduction results for the low-power version of Spartan 3L are even better. The tables in Fig. 3 indicate the sources of power savings.

The results might change if both implementations had heavy load on many I/O pins, but in this paper we have concentrated on comparisons of the power savings in the FPGA core. We can see that the small increase of leakage current in the 90 nm Spartan 3 is well compensated by the dynamic current savings. The technology used in Spartan 3L is indeed reducing the quiescent current, but the reduction is less dramatic.

From the cost/performance point of view, the coming economical Spartan 3E family (see the floor-plan in Fig. 4) might be a very interesting option for our future floating-point and control designs based on the same PicoBlaze network.

Free downloadable WebPack set of Xilinx design tools covers large Spartan 3 parts and this is also likely to contribute to a wide use of the Spartan 3 family.

Thanks

This work has been partially supported by the Ministry of Education of the Czech Republic projects: 1M6840770004 and 1ET400750406.

-----------------------

Vector

Product

FP HW

18m11

Vector

Product

FP HW

18m11

Vector

Product

FP HW

18m11

Vector

Product

FP HW

18m11

1x PicoBlaze Master

1 us time-base

RS232 38200bps

4x PicoBlaze

Workers

4x 3 DP BRAMs

X, Y, Z

4x dedicated HW:

18 bit FP

MACs at 50MHz

From Simulink

and DK4 test b.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download