DSPBA: Flow Control, Design Style and Floating Point



[pic]

DSPBA: Flow Control, Design Style and Floating Point

September 21, 2011

© Altera Corporation

Table of Contents

1 Overview 5

2 DSPBA Basics 6

3 Essential Checklists 11

3.1 Top Level Design 11

3.2 Primitive Subsystems 11

3.3 Verification 12

4 Recommendation Checklists 12

4.1 Simulink Design settings 12

4.2 Top Level Design 13

4.3 Primitive Subsystems 14

4.4 Design Style 14

4.5 Verification 14

5 FAQ 16

Design Style 27

5.1 Getting Started 27

5.1.1 createPrimitiveFIR 27

5.1.2 createDSPBADesign 29

5.2 Using Vectors 30

5.2.1 Use of vectors: with ModelPrim blocks 30

5.2.2 Use of vectors: Additional Libraries > Vector Util Library 33

5.2.3 Use of vectors: Other supported Vector blocks 36

5.3 Implementing Flow Control 40

5.3.1 Flow control using latches 40

5.3.2 Flow control using simple LOOP 50

5.3.3 Flow control using ForLoop blocks 52

5.3.4 LOOP vs ForLoop 59

5.4 Building System Components 59

5.4.1 Avalon-ST Output 63

5.4.2 Avalon-ST Input 65

5.4.3 Avalon-ST Input FIFO 66

5.4.4 Extending the interface definition 66

5.4.5 Restrictions on use 69

5.5 Using ModelPrim subsystems 70

5.5.1 Interfaces as subsystem boundaries 70

5.5.2 Interfaces as scheduling boundaries 71

5.5.3 ModelPrim subsystem design styles to avoid 78

5.5.4 Common Problems 78

5.5.5 ModelPrim Blocks outside primitive subsystems 80

5.5.6 Convert blocks vs. specifying output types via dialog 82

6 Debugging Designs 86

7 Floating Point 87

7.1 Support Outline 87

7.1.1 Blocks 87

7.1.2 Interaction with other features 90

7.2 Floating Point Format 92

7.2.1 Single Precision Word Formats 92

7.2.2 Double Precision Word Formats 95

7.2.3 Floating Point Type propagation 98

7.3 Special considerations when using floating point 99

7.3.1 Flow Control, latency hiding and avoiding data dependencies 99

Appendix: Generated Test-benches 108

Appendix: Overriding Test-benches in Matlab 111

Figures

Figure 1: Model File Top Level 8

Figure 2: Synthesizable System Top Level 9

Figure 3: Primitive Subsystem Top Level 10

Figure 4: Using a matrix allows simple initialization of a vector of LUTs each with different contents, without having to specify each separately 31

Figure 5: Vector initialization of SampleDelay when used with vectors (and equivalent version with individual sample delays) 32

Figure 6: Expand Scalar block for parameterizable signal replication to vector 33

Figure 7: Vector Mux Masked Subsystem for dynamic vector signal selection, reparameterizable in vector width 34

Figure 8: Zero Latency Latch 41

Figure 9: Single-cycle latency latch 41

Figure 10: Set/Reset Boolean latch with reset priority 42

Figure 11: Set/Reset Boolean latch with set priority 42

Figure 12: Simple filter with enabled delay chain - full layout 44

Figure 13: Output of demo_forward_pressure, stalled when valid is low – but without the need for a high fanout enable net. 45

Figure 14: Vectorised FIR stage 47

Figure 15: Vectorized FIR stage using TappedDelayLine from Vector Utils Library 47

Figure 16: demo_back_pressure example design 47

Figure 17: Simple FIFO model, used to illustrate considerations in safe usage 48

Figure 18: Hardware simulation output for example 49

Figure 19: Loop block and equivalent C code for two-dimensional count limit vector 'c' 50

Figure 20: demo_kronecker; using nested loop to generate datapaths that operate on regular data. 51

Figure 21: Rectangular nested loop 54

Figure 22: Triangular nested loop 56

Figure 23: Avalon Stream Library appears under Additional Libraries as the blocks in the Library are in fact just Masked Primitive subsystems 60

Figure 24: The Avalon Stream Interface Library 61

Figure 25: DSP design with Avalon-ST interfaces 61

Figure 26: Tag the Simulink Port internal to the Avalon-ST Masked Subsystem to define the port role 67

Figure 27: Example Block Properties GUI for Avalon-ST Masked Subsystem showing Description field 68

Figure 28: Packing and unpacking a vector into a single data connection 70

Figure 29: Simple primitive subsystem with line of identical registers across all input to output paths 72

Figure 30: Another simple primitive subsystem with line of identical registers across all input to output paths 73

Figure 31: Simple primitive subsystem with independently synchronized outputs 75

Figure 32: Simple primitive design with multiple input groups and multiple output groups 76

Figure 33: Schedule, after pipelining for above primitive subsystem 76

Figure 34: example of a multiple I/O primitive subsystem where an input is scheduled after an output 77

Figure 35: Blocks not driven from clocked blocks 78

Figure 36: Synchronizing logic dependent on reset (bad) 80

Figure 37: Synchronizing logic dependent on valid (good) 80

Figure 38: Convert block changes data-type preserving real-world value (as far as possible), with options to round and saturate. It can grow the number of bits - sign extending or zero-padding where appropriate 82

Figure 39: Convert block changes data-type preserving real-world value (as far as possible), with options to round and saturate 83

Figure 40: Setting an output type explicitly via a Primitive dialog for any other blocks changes type while preserving the bit pattern. The real world value will generally be scaled in such cases 84

Figure 41: Setting an output type via dialog and reducing the bit-width will discard the top, 'Most Significant' bits 85

Figure 42: Output data type selection UI - showing single and double as options 88

Figure 43: New floating point primitive blocks in 10.1 Advanced Blockset 89

Figure 44: Use of FIFOs (and loops) to control running of floating point calculations without explicitly waiting for the start-to-finish calculation latency. Result can feed into similar downsteam processes. 100

Figure 45: Flow control for Madelbrot calculation 102

Figure 46: Insertion of sufficient lumped 'SampleDelay' to allow for pipeling. 103

Figure 47: Generated Automatic TestBench files 109

Overview

This document describes some recommended design styles when using DSP Builder, Advanced Blockset as an incremental addition to the current documentation.

It mostly focuses on recent enhancements to the tool (10.0+).

In particular it covers use of vectors with ModelPrim blocks, how to implement efficient flow control, floating point and special design considerations when using it, and some design patterns to avoid. Examples are used to illustrate the general principles. The document is mostly restricted to designs using primitive subsystems.

DSPBA Basics

Top level: (See Figure 1)

The top level of the Model consists of

• Simulink test-bench - that is blocks to provide inputs, and to analyse inputs and outputs

• Required DSPBA top level parameterization blocks: Signals (bus clock specification, system clock specification) & Control (RTL output directory, top level threshold parameters)

• Links to open post-generation tools: Run Modelsim (open ModelSim to run generated RTL testbench and compare against Simulink at the synthesizable system level), Run Quartus (open the generated RTL project in Quartus to do a full Quartus compile)

• Other blocks, such as: Run All Testbenches - a UI for the scripts to control run system-level and ModelIP and primitive subsystem level Automatic Test Benches (ATBs), optional short-cuts to edit parameterization files that run on model start-up and/or pre- or post-simulation

Synthesizable System Top Level: (See Figure 2)

• The part of the design to be synthesized is separated hierarchically. What will form the top level of the synthesizable part is indicated by a Device block, which sets which family, part, speed grade etc to target.

• This level can consist of further level of hierarchies that include Primitive Subsystems - scheduled domains for ModelPrim blocks (the low-level blocks such as delays, mults, adds) - and ModelIP Blocks - the standalone macro functions (NCO, FIR, CIC)

• Optionally further LocalThreshold blocks can be included to override threshold settings defined higher up the hierarchy.

Primitive Subsystem Top Level: (See Figure 3)

• Primitive Subsystems are scheduled domains for the ModelPrim blocks. A SynthesisInfo block is required. Blocks to delimit the Primitive subsystem are also required: ChannelIn (Channelized Input), ChannelOut (Channelized Output), GPIn (General Purpose Input) and GPOut (General Purpose Output). Within these boundary blocks the tool will optimize the implementation specified by the schematic - including the insertion of pipelining registers required to achieve the specified system clock rate. When inserting pipelining registers, equivalent latency has to be added to parallel signals that are required to be kept synchronous so that they are scheduled together. Signals that go through the same input boundary block (ChannelIn or GPIn) are scheduled to start at the same point in time; Signals that go through the same output boundary block (ChannelOut or GPOut) are scheduled to finish at the same point in time. Any pipelining latency added to achieve Fmax is then added in balanced 'cuts' through the signals across the design. The correction to the simulation to account for this latency added in HDL generation is applied at the boundary blocks, such that the Primitive Subsystem as a whole will remain cycle accurate.

• Note that further levels of hierarchy can be defined within primitive subsystems containing primitive blocks - (but no primitive boundary blocks or ModelIP blocks)

[pic]

[pic]

[pic]

Essential Checklists

1 Top Level Design

1. There must be a Control block and a Signals block at the top level

2. The synthesizable part of your design must be a subsystem or contained within a subsystem of the top level.

3. There must be a Device block at the hierarchical level of the synthesizable part of the design.

4. Test-bench stimulus data-types feeding into the synthesizable design will be propagated – so ensure they are correct. Switch on all of ‘Port Data Types’, ‘Signal Dimensions’ and ‘Wide Non-Scalar Lines’ from the Simulink model Format > Port/Signal Displays menu to have these annotated to your model so they are visible.

2 Primitive Subsystems

1. Don’t try to pipeline the system yourself – this is what the tool does for you using its internal timing models and integer linear programming. Only add Sample Delays where they are part of the algorithm; that it where your algorithm explicitly requires you to think about combining data samples from different clock cycles. This can include feedback loops. If your design is not meeting timing, you may want to consider using the Clock Margin parameter on the top level Signals block, or on a LocalThreshold block.

2. The subsystem must contain a SynthesisInfo block with style set to Scheduled.

3. Primitive subsystems cannot contain ModelIP blocks.

4. All subsystem inputs with associated ‘Valid’ and ‘Channel’ signals that are to be scheduled together should be routed through the same ChannelIn blocks immediately following the subsystem inputs. Any other subsystem inputs should be routed through GPIn blocks.

5. All subsystem outputs with associated ‘Valid’ and ‘Channel’ signals that are to be scheduled together should be routed through the same ChannelOut blocks immediately before the subsystem outputs. Any other subsystem outputs should be routed through GPOut blocks.

6. Use Convert blocks to change data type preserving real-world value.

7. Use Set Via Dialog options to change data type preserving bit-pattern (with no bits added or removed), or to fix a data type.

8. Use Reinterpret Cast to change data type preserving bit-pattern (with no bits added or removed); for example if converting a ‘uint32’ to ‘single’

9. The valid signal is a scalar boolean or ufix(1).

10. The channel signal is a scalar uint(8).

3 Verification

1. Turn on ‘Create Automatic Testbenches’ and ‘Coverage in Testbenches’ on the Control block to use. Note that stimulus capture for test-benches is done on the inputs and outputs of ModelIP blocks and by ChannelIn, ChannelOut, GPIn and GPOut blocks.

2. Run all testbenches with

run_all_atbs(, , Configuration Parameters > Solver Options to “Fixed-step” / “discrete (no continuous states)”, unless folding, or you have multiple clocks in your test-bench in which case set to “Variable-step” / “discrete (no continuous states)”. This gives faster simulation than continuous solvers and also correct results round loops.

• Tick all options on Format > Port/Signal Displays, except “Storage Class”. It is then clear which signals are complex, which are vectors, and the data types.

• Hide the names of unimportant blocks to de-clutter your design using Format > hide name

• If using From and Goto blocks, color linked blocks the same by selecting all linked Froms & Gotos (hold down shift while clicking on each block), then right click > background color. This makes tracing the connectivity easier to see.

• Annotate your designs. Just double click anywhere on the background and start typing. Use Simulink Documentation blocks to link to external documentation.

• Matlab Window > File & Folder comparisons is a great way to see what has changed between versions of your design.

2 Top Level Design

1. Use workspace variables to set parameters you may want to vary; including clock rates, sample rates, bit-widths, channels, etc.

2. Set workspace variables in initialization scripts. It is suggested that these are executed on the model’s PreLoadFnc and InitFcn callbacks, such that the design opens with parameters set, and any changes will be reflected in the next simulation, without having to explicitly run the script or open & close the model.

3. Call your main initialization script for the model ‘setup_’, and – as a shortcut to editing it – include the Edit Params block in the top level of your design. (This can be found by right-clicking on the Base Blocks library in the Simulink Library Browser and Selecting ‘Open Library’.)

4. Build a test-bench that is parameterizable – i.e. will vary correctly with system parameters such as Sample Rate, Clock Rate, and number of channels. The Channelizer block in Beta Utilities Library may be useful for this.

5. Use the model’s StopFnc call back to run any analysis scripts automatically

6. Build systems that make use of the valid and channel signals for control and synchronization; not latency matching. For example by capturing valid output in FIFOs to manage data-flow.

7. Build up and use your own libraries of reusable components. You can even use the “Configurable Subsystem block” in libraries to provide a single link from which you can select library implementations in place. (See “Configurable Subsystem block” in the Simulink help).

8. Keep block and subsystem names short, but descriptive. Avoid names with special characters, slashes or beginning with numbers.

9. Use LocalThreshold blocks, in conjunction with the top level thresholds, for localized trade-offs or pipelining effort tweaks if necessary.

3 Primitive Subsystems

1. Make use of vectors to build parameterizable designs – that don’t need redrawing when parameters such as number of channels changes.

2. Ensure there is sufficient Sample Delays around loops to allow for pipelining.

3. Data-type, complexity and vector width propagation is done by Simulink. Sometimes this is not successfully resolved round loops, particularly multiple nested loops. If unsuccessful, look for where data-types are not annotated. You may have to explicitly set data types. Else Simulink provides a library of functions to help in such situations, which duplicate data types etc. This is fixpt_dtprop (type ‘open fixpt_dtprop’ from the Matlab command prompt to open). The ‘Data Type Prop Duplicate’ block is used in the Control library latches, for example. These are guides to Simulink on data-type propagation, and do not produce hardware.

4. If routing within a Primitive Subsystem is getting complex, you might consider using Simulink From / Goto blocks to replace connections. Make sure that the Tag Visibility on the Goto blocks is Global if crossing subsystems within a primitive subsystems. You can color code blocks too (right click > background color) to make connections more obvious.

4 Design Style

• Don’t try to pipeline the design yourself – this is what the tool does for you using mathematical linear programming techniques. If you need more pipelining, use a positive Clock Margin (see Signals block).

• Don’t try to synchronize the output of different parallel subsystems using explicit delays. Use FIFOs, as this will give a more device-portable, fmax target independent design.  

• Break designs up hierarchically to make your design understandable. However, keep consecutive primitive subsystems together within single ChannelIn/Out blocks, as this gives greater scope for scheduling and pipelining optimizations.

• If you think you need complex control with complex feedback or cycle-counting from the data path, think again. Look at the Mandelbrot design and understand what it is doing. It creates command instructions which are placed in a FIFO. The instructions are consumed by the data-path as fast as it can run. The result is a design that runs as fast as it can and is portable between device families, and which reduces the complexity of the control logic.

5 Verification

1. Remember that output is only guaranteed to match hardware when valid is high.

2. run_modelsim_atb displays the command it executes. This command can be cut and paste into an open ModelSim UI open at the same directory and run manually. Using this you can analyze the behavior of particular subsystems in detail, and can force simulation to continue past errors if necessary.

3. If using FIFOs within multiple feedback loops, it is possible that while the data-throughput and frequency of invalid cycles is the same, their distribution over a frame of data many vary (due to the final distribution of delays around the loop). If a mismatch is found, it is therefore worth stepping past errors using the above process to check whether this is the case.

4. Floating point simulation is compared to within a tolerance. Differences are likely to be in the few least significant bits only, but could potentially be higher if the function you are implementing is ill-conditioned. Larger relative differences can also arise in complex multiplication with large complex numbers that lie close to the real or imaginary axis

5. Use the Run All Testbenches block to control test-benches – and to access the override feature, where ModelSim results can be automatically imported back in the Matlab and a custom Matlab function used to verify and provide the pass/fail criteria. (See appendix).

FAQ

1) What do the ChannelIn and ChannelOut blocks do?

The optimizations performed by the tool operate within subdomains of the whole design: individually within each ModelIP block (FIR, NCO etc) and within each Primitive subsystem.

A Primitive subsystem is a Simulink subsystem with a ModelPim SynthesisInfo block, inputs and outputs (at the SynthesisInfo level) passing through the boundary blocks ChannelIn/Out or GPIn/Out at the subsystem I/Os, and containing that part of the design built from ModelPrim blocks. It can contain further Simulink subsystems, but no nested ModelIP blocks or further SynthesisInfo blocks.

The ChannelIn and ChannelOut blocks (and GPIn and GPOut) delimit the boundaries of a primitive subsystem. They group signals (either with (ChannelIn/Out) or without (GPIn/Out) related channel and valid signals) at the boundary to be scheduled together. When determining the pipelining to be added in order to achieve the desired Fmax, the tool needs to know which signals should be kept synchronized, such that adding latency to one will require balancing delays to be added to the synchronous signals. Added pipelining is then added in balanced ‘cuts’ through the synchronized signals, such that they added delay can be corrected for (in most cases) at the subsystem level just by adding simulation delays in the appropriate boundary blocks.

See 5.5 below for further details.

2) If I have a block with data flow and a parameter (e.g. gain) where one single parameter (the gain) is given at any time without any timing relation to the data flow), how should those be used?

If the signals are independent and to do not have to remain synchronous then you can put them through separate boundary blocks.

[pic]

3) What does the error message “Warning: Negative IO correction on block //ChannelOut. Simulink will not match hardware” mean?

See 5.5.2.1 below.

4) What does the error message “Unable to determine data types for some ports, cannot continue” mean?

It means that the data type of a signal cannot be uniquely determined from the word growth and inheritance rules set on the blocks. This can arise in feedback loops with inherit or growth rules or in blocks with unconnected inputs. Consider the following example where the multiplier and the SampleDelay both have ‘Output data type mode’ as “Inherit via internal rule”.

[pic]

The word growth rule for multipliers is to add the integer and fractional bit widths of the inputs to get the output type. The SampleDelay preserves the input type without change by default. It can be seen here then that the output type of the Mult cannot be determined:

If input A is sfix17_En16, and input B = Mult output type = is sfixP_Q then the output is sfix(17+P)_En(16+Q) under the default inheritance rules, i.e. P=P+17 and Q=Q+16 – which has no solution for P & Q.

We must fix the type in this loop explicitly. This can be done by explicitly setting a type on one of the blocks; by adding a Convert block to set the type, or by using the Simulink “Data Type Prop Duplicate” block from the Simulink fixpt_dtprop library, which copies the type from the signal attached to ‘Ref’ to the signal attached to ‘Prop’.

[pic]

This method may be favorable as it is flexible to input type, though an alternative approach is to write a flexible Matlab expression that is evaluated to set the type. Note that other simple propagation type blocks can also be used from this library, for example;

[pic]

5) What does the error message “Failed to distribute memory in your design” mean?

[pic]

The tools automatically inserts pipelining required to meet the chosen clock speed, based on internal timing models. For example to run an 18bit multiplier at 400MHz, our timing models may suggest that 3 registering stages are required across the multiplier block.

Suppose you have a feedback loop where a latency of 5 clock cycles is specified (for example if your data is for 5 channels in sequence). We can satisfy both criteria: pipelining for fmax and re-circulating the data in 5 clock cycles. Rather than adding the 5 cycles of delay specified by the SampleDelay as 5 new registers, 3 of those delays will be formed from the delay required across the multiplier and only 2 will be implemented as external registers. The delay has been ‘distributed’ around the loop. More complex, multiply nested loops require more complex delay redistributions – but all of this is solved using standard mathematical linear programming techniques.

Now suppose you have a feedback loop where a latency of 1 clock cycle is specified. The two criteria: pipeline to achieve fmax (implying a latency of at least 3 clock cycles) and the loop criteria (re-circulate the data in 1 clock cycle), cannot simultaneously be satisfied, no matter where we distribute the 1 cycle of delay specified. In this case an error is given:

Failed to distribute memory in your design. Found insufficient delay attempting to satisfy fMax requirement for [subsystem]. Failed to satisfy the following latency constraints: (ParallelPathPair 0): Mult SampleDelay (2 cycles deficient)

What this says is that the design as specified is imposing a restriction on the pipelining that is required to meet the clock speed requirement, such that both cannot be satisfied simultaneously.

It may be possible in some cases to re-implement the algorithm to avoid loops, or to run the designs faster but push the data through at the same rate. (for example, if running at 100MHz with a new data sample every clock cycle, instead run at 300MHz and have a new data sample every 3 clock cycles [DSPBA optimizations and timing characterization is currently targeted at high clock rates]. Folding (manual or automatic) can then be used to reduce hardware resources elsewhere in the design if clock rate > sample rate.

6) My design worked, I turned on folding for a Primitive Subsystem and got a Simulink error: “S-function '//ChannelOut' method mdlSetInputPortSampleTime cannot change the sample time of ports once they have been set.”

Simulink has propagation and setting rules for data types, sample rates, etc. that attempt to resolve and fix these fields for each port. Folding changes the Simulink sample rate at which the primitive subsystem runs. If you get this message it’s because there is a conflict in sample time settings: the ChannelOut has been set to run at the folded sample rate, but a block within the primitive subsystem itself has an explicit sample time set on it that conflicts when propagated forwards to the ChannelOut. Check that the sample times of the blocks in the primitive subsystem are set to ‘-1’ (inherit) where appropriate.

7) How should we use the Avalon Blocks?

The Avalon-MM (ModelBus) and Avalon-ST blocks are used in different ways. Refer to the DSP Builder documentation on how to use these. For flow control the Avalon-ST output “ready” signal should be looped back to the Avalon ST input “ready” signal. This is shown in the diagram below.

[pic]

8) Is it possible to have portion of the graph depending on some variables? e.g. having clockrate/N adders, where clockrate is defined in the parameter file.

There are several ways to do this. The first is through the use of vectors, where the vector size determines the number of blocks that will be produced. Vectors are very useful in building parameterizable components. The other is to create a self initializing subsystem component – see 5.2.2.1 for an example of a block which is really a self initializing subsystem.

9) How do we initialize the value in a register (SampleDelay)?

This is not currently supported directly. Delays specified by SampleDelays can get redistributed around the system – and hence implemented as registers in memory blocks or multipliers where initialization is impossible.

10) What is the list of supported Simulink blocks that can be use for HW generation?

Mux, Demux, From, Goto, Subsystem ports (Out1, In1), Terminator, Constant, Selector (static Vector selection only), Complex to Real-Imag, Real-Imag to Complex, Configurable Subsystem (with some restrictions), Data-type propagation (with some restrictions).

11) Can I mix VHDL with Simulink (or other design languages?)

The HDL Import block can be used in a Standard Blockset level hierarchically above the DSPBA design. See documentation on HDL Import and mixed Blockset designs.

12) Can I create my own equivalent of the ‘Edit Params’ block

Yes. The Simulink documentation covers such matters.

It is deliberate that we do not show the Edit Params block in DSPBA library browser: the block itself does nothing other than open a file for editing. The user would have to create the script and set up the pre-load functions on the model properties to use correctly. It is not something you can just drag and drop onto your model.

You can achieve the same by creating any m script which is run in the models set-up stage (PreLoadFcn), or indeed any other such stage if necessary.

The Edit Params assumes that the name of this script is “setup_.m” – but this is just the way it’s done for this use case – you could call it whatever you like (e.g my_script.m)

Use File > Model Properites > Callbacks to get your script to run before simulation. For a design demo_duc using edit params you have to add setup_demo_duc to the PreLoadFunc (so that the parameters exist on loading, and in the InitFnc, so that changes you make with the model open before running simulation will be included in the simulation run.. (If you called the script my_script.m, then just put my_script; at the appropriate stages.

[pic]

The edit params block is a masked subsystem that has been given an OpenFnc to open an m-script “setup_” so you can edit it

s = sprintf('edit setup_%s', eval('gcs'));    ⇓ this bit is setting up the name of the script in a ‘edit’ command

eval(s);                                      ⇓ this bit is executing the command

Drop in a Simulink subsystem. Go in and remove the default ports. Back out, right click ... block properties … and set the OpenFnc

[pic]

Alternatively if you called you set-up script ‘my_script.m’ the OpenFnc would be …

[pic]

The block will now open the script for editing when clicked, and will look like this:

[pic]

You can call this subsystem block what you like … or hide the name it doesn’t matter. All it is is a way of opening the set up script for editing. The important thing is the OpenFnc.

You can even add a picture or graphic for it. For example if you have a picture “ant.jpg” in a directory which is included in you matlab path (file > set path …) then you can right click on the subsystem block > Edit Mask … and add something like the following (which sets an image, sets the text color to white, and writes “Antonnios Set Up Script” across it)

[pic]

To give

 

[pic]

You can debate whether this is an improvement.

13) Is it possible to restrict the scope of the variables defined in the setup script to the model they apply to only?

The recommendation is to create a structure of variables for the model to avoid ambiguity if running multiple models. The Simulink help also has some information on the scope of workspace and model variables.

Design Style

1 Getting Started

We recommend you follow the checklists outlined above in setting up your system.

The demonstration designs often make good starting points, and can be copied, renamed and saved into a new working directory along with any setup scripts.

There are also a couple of scripts to get you started on building the basics of a DSPBA design.

1 createPrimitiveFIR

createPrimitiveFIR creates a complete FIR filter design using DSPBA primitive blocks. There are several ways to pass in parameters to this: an ordered short-list of parameters, a MATLAB struct, or as name – value pairs.

The command line call

createPrimitiveFIR(NAME,COEFFS,NUMCHANS,COEFTYPE,COEFSIGNALTYPE,DATASIGNALTYPE)

creates a design called NAME with COEFFS taps.

createPrimitiveFIR(NAME,PARAMSTRUCTURE),where PARAMSTRUCTURE is a MATLAB struct, allows you to pass the parameters in as a struct with any unset parameters reverting to the defaults listed below. The structure should have fields with the names of the parameters as below (case-insensitive).

createPrimitiveFIR(NAME, PARAMNAME1, PARAMVALUE1, PARAMNAME2, PARAMVALUE2, ...) is as above but with the parameters passed in as name-value pairs.

|Parameter |Description / values |

|NAME |Name of model to create |

|COEFTYPE |Affects how the coefficients are stored as follows: |

| |Constant (default) - stored in constant blocks |

| |Read - stored in RegField blocks and can be read via the bus interface. |

| |Write - stored in RegField blocks and can be written via the bus interface. |

| |Readwrite - stored in RegField blocks and can be read and written via the bus interface. |

|COEFSIGNALTYPE |The simulink type for coefficient values. It defaults to 'sfix16_En15'. |

|DATATYPE |The simulink type for the input sources. It defaults to 'sfix16_En15' |

|COEFFS |If this is an array these are the coefficients of the filter. The filter will have as many taps as |

| |there are coefficients. |

|TYPE |'single' creates a single rate FIR (default) |

| |'decim' creates a decimating FIR |

| |'interp' creates an interpolating FIR |

|SYMMETRY |'off' creates a non-symmetric FIR (default) |

| |'on' creates a symmetric FIR |

| |'anti' creates an anti-symmetric FIR |

|BAND |'full' creates a full-band FIR (default) |

| |'half' creates a half-band FIR |

| |num2str(X) creates 1/X band FIR. e.g. '4' creates a quarter band FIR |

|DECHAN |false wires the resulting FIR directly to a scope |

| |true wires the resulting FIR to a ChanView block which is then wired to a scope |

|CLOCKRATE |This is the clock rate in MHz. It defaults to 200. |

|SAMPLERATE |This is the rate of the data in MHz. It defaults to the same as CLOCKRATE except if interpolating, in |

| |which case it defaults CLOCKRATE/2 |

|COMPAREMODELIP |If set to true, this generates an equivalent ModelIP filter along side and creates assertion blocks to|

| |verify that both systems are equivalent. |

|MODELIPONLY |If set to true, a ModelIP FIR is created with no primitive FIR |

|RUNCHECK |If set to true then the design is simulated immediately after creation |

|QUARTUSCOMPILE |If set to true then the designs is run through quartus. (requires RUNCHECK=true |

|RAMTHRESHOLDBITS |This is the threshold set in the "CDelay RAM Block Threshold" parameter on the control block |

|FAMILY |Device family. Accepted values are: Stratix, Stratix GX, Stratix II, Stratix II GX, Stratix III, |

| |Stratix IV, Cyclone II, Cyclone III, Arria II GX, Cyclone III LS |

|SPEEDGRADE |Device speed grade. Accepted values are: fast, medium and slow |

|REPLACEMODEL |If set to true, existing models with the same name will be closed and replaced |

2 createDSPBADesign

createDSPBADesign creates an empty primitive subsystem design with required blocks in place. The default parameters are:

defaults.dataInputs = 1;

defaults.dataOutputs = 1;

defaults.chanCount = 1;

defaults.sourceType = 'constant';

defaults.sourceValue = 'fixed';

defaults.sourceScale = 1;

defaults.sourceImpulseGap = 10;

defaults.sourceSignalType = 'sfix(32)';

defaults.wireUpValidAndChan = true;

defaults.dechan = false;

defaults.clockRate = 200.00;

defaults.sampleRate = 200.00;

defaults.scopeInputs = false;

defaults.primitiveSampleRate = 200.00;

defaults.primitiveChannels = 1;

defaults.subsystemNames = {'subsystem'};

defaults.subsystemTypes = {'prim'};

defaults.matchDelays = false;

defaults.filterReference = 'DSPBAFilters/SingleRateFIR';

defaults.filterParams = {'nInputRate', 'SampleRate', 'nchan', 'ChanCount', 'symmetry', 'Non Symmetrical', 'addr', '0'};

defaults.ramThresholdBits = '-1';

defaults.family = 'Stratix II';

defaults.replaceModel = false;

defaults.speedGrade = 'fast';

These are used if just called with a design name to create, e.g. createDSPBADesign(‘foo’) will create:

The default can be overridden by creating a corresponding struct of parameters and passing this as a second argument; e.g. createDSPBADesign(‘foo’, myparams)

2 Using Vectors

1 Use of vectors: with ModelPrim blocks

The use of vectors has many advantages; making designs more parameterizable, speeding up simulation and simplifying the schematic. Vectors avoid cut-and-paste duplication in many instances – and enables flexible designs which scale with input vector width.

This section illustrates the use of some vector features for building more parameterizable design components. Most also have an associated design example.

1 Matrix initialisation of vector memories

Demo: demo_dualmem_matrix_init

Both the dual memory and LUT primitive blocks can be initialized with matrix data.

This feature is useful in designs that handle vector data and require individual components of each vector in the dual memory to be initialized uniquely.

The addressable size of the dual memory is determined by the number of rows in the 2D matrix provided for initialisation. The number of columns must match the width of the vector data. So the nth column specifies the contents of the nth dual memory. Within each of these columns the ith row specifies the contents at the (i-1)th address (since first row is address zero, second row address 1 an so on).

The exception for this row / column interpretation of the initialization matrix is for 1D data, where the initialization matrix consists of either a single column or single row. In this case the interpretation is flexible and maps the vector (row or column) into the contents of each dual memory defaults (i.e. the previous behaviour, in which all dual memories have identical initial contents.

The demo_dualmem_matrix_init example shows use of this feature. It also uses complex values in both the initialisation and the data that is later written to the dual memory. The contents matrix is set up in the model’s set-up script, run on model initialization. Click on ‘Edit Params’ to see this.

2 Matrix initialization of LUT demo

Demo: demo_lut_matrix_init

LUTs (Look Up Tables) can be initialized in exactly the same way. The demonstration example feeds a vector of addresses to the primitive block such that each vector component is given a different address. This also shows LUTs working with complex data types.

The figure below shows the equivalent system, with each LUT initialized individually. Using the Matrix avoids having to demux – connect – and mux, so that parameterizable systems can be built.

3 Vector initialization of sample delay demo

Demo: demo_sample_delay_vector

When the sample delay primitive block receives vector input, it is possible to independently specify a different delay for each of the components of the vector.

The demo_sample_delay_vector design example shows that one sample delay can replace what would have previously required a DeMUX-SampDelay-MUX combination.

Individual components may even be given zero delay resulting in a direct feed through of only that component. Care must still be taken to avoid algebraic loops if some components are chosen to be zero delays.

This of course only applies when vector data is being read and output. A scalar specification of delay length still has the prior behaviour of setting all the delays on each vector component to the same value. It is an error to specify a vector that is not the same length as the vector on the input port. A negative delay on any one component is also an error. However, as in the scalar case, it is allowable to specify a zero length delay for one or more of the components.

2 Use of vectors: Additional Libraries > Vector Util Library

Often the ability to build a vector parameterizable library component is stopped by the need for a parameterizable way to go to and from single connections to vectors – either by replication or selection. For example replicating a single signal N times to form a vector. If you had to draw and connect this when the desired vector with changed, the ability to parameterize is lost. Fortunately it is fairly straight-forward to use Simulink commands in the initialization of Masked Subsystems to do the parameterization and reconnection automatically.

There are some examples of this in the Vector Util Library. They all use standard Simulink commands for finding blocks, deleting blocks and lines, adding blocks and lines and positioning blocks. As such users could use this technique to build parameterizable utility functions themselves.

1 Expand Scalar

Expand scalar just takes a single connection and replicates it N times to form a width N vector. This is done by passing on the width parameter to a Simulink mux under the mask, and using some standard Simulink commands to add the connections lines.

2 Vector Mux

3 Tapped Delay Line

The Tapped Delay Line makes use of Latches from the Control library. See the section on Flow Control later. Again this is an auto-generating subsystem. The initialization script is shown above, alongside the subsystem it will generate for 4 taps. Note, gcb is Simulink shorthand for ‘get current block’ – i.e. get the current subsystem we’re parameterizing.

Note that vector signal input is not supported for this block.

3 Use of vectors: Other supported Vector blocks

1 Simulink Selector (partial support) for static Vector selection

The Simulink Selector block enables selection of some signals out of a vector of signals, including operations such as reordering. Currently only dialog selection is support – equivalent to a static selection. Port selection is not currently supported. The index and input port size can be set via WorkSpace variables.

Supported Features

– Number of input dimensions – 1 only

– Index mode : Zero-based or One-based

– Index Options

– Select all

– Index Vector (dialog)

– Starting index (dialog)

Unsupported Features

– Multi-dimensional input

– The following Index Options are unsupported

– Index Vector (port)

– Starting index (port)

2 Examples

1 Reverse Order

[pic]

2 Select every third wire

[pic]

3 Select every third wire – reverse order

[pic]

4 Interleaved replication

[pic]

5 Just the first

[pic]

6 First half of vector signals

[pic]

This is a Simulink block, so you can refer to Simulink help for further information. In hardware, this just synthesizes to wiring.

1 Implementing Flow Control

The Advanced block-set tool encourages the use of Valid and Channel signals alongside data to indicate when data is valid and for synchronization. The user is encouraged to build designs using these to process valid data and ignore invalid data cycles in a streaming style that makes best use of the FPGA. This way designs can be built that run as fast as the data allows, that are not sensitive to latency or devices fmax and which can be responsive to back-pressure.

The style involves use of FIFOs for capture and flow control of valid outputs, Loops and For Loops for simple and complex nested counter structures, and ‘latches’ to enable only components with state – thus minimizing enable line fan-out, which can otherwise be a bottleneck to performance.

1 Flow control using latches

A latch normally has bad connotations for hardware designers. Here, however, these subsystems just synthesize to enabled flip-flops, so ‘flip-flop’ or ‘sample-and-hold’ might be other ways to describe these.

1 Additional Libraries > Control

Often designs require that signals are stalled or enabled. The approach of having an enable signal routed to all the blocks in the design can lead to high-fan-out nets, which become the critical timing path in the design. A way to avoid this is to enable only blocks with state, while marking output data as invalid when necessary.

To do this a number of utility functions have been created in the Additional Blocks > Control library. These are all just Masked Subsystem. Looking underneath shows the blocks used.

Aside: Note that some of these blocks make use of the Simulink Data Type Prop Duplicate block. This takes the data type of a reference signal ‘Ref’ and back-propagates it to another signal ‘Prop’. This is a good way of matching data-types without forcing an explicit type that can be used in other areas of your design.

1 Zero Latency Latch

.

For the zero latency latch the enable signal has an immediate effect on the output. While the enable is high the data passes straight through. When it goes low, the data input from the previous cycle is output and held.

2 Single-Cycle Latency Latch

For the single-cycle latency latch the enable signal affect the output on the following clock cycle.

These latches work for any data type, and for vector and complex.

3 Reset Priority Latch

There are also 2 single-cycle latency latch subsystems for common operations for the valid signal, latching with set and reset. The SR latch gives priority to the reset input signal, whereas the SRlatch Priority Set gives priority to the set input signal. In both cases if set and reset inputs are both zero the current output state is maintained.

Table 1: Truth table for SRLatch (reset priority)

|S |R |Q |

|0 |0 |Q |

|1 |0 |1 |

|0 |1 |0 |

|1 |1 |0 |

4 Set Priority Latch

Table 2: Truth table for SRlatch Priority Set

|S |R |q |

|0 |0 |q |

|1 |0 |1 |

|0 |1 |0 |

|1 |1 |1 |

1 Using latches to implement forward flow control

Example: demo_forward_pressure

Here we have a sequence of three FIR filters that stall when the valid signal is low, preventing invalid data polluting the data-path.

[pic]

If we look inside one of these subsystems we see a regular filter structure, but with a delay line implemented in single-cycle latches; effectively an enabled delay line.

Note: We don’t need to enable everything in the filter (multipliers, adders etc), just those blocks with state (the registers), then take account of the output valid signal – pipelined alongside the logic by the tool - and look at the valid output data only.

[pic]

Figure 12: Simple filter with enabled delay chain - full layout

Note here how the first latch is a zero latency latch, and all others are single cycle.

[pic]

Figure 13: Output of demo_forward_pressure, stalled when valid is low – but without the need for a high fanout enable net.

Of course we could also use vectors to simply the constant mults and adder tree – which would also speed up simulation.

[pic]

Figure 14: Vectorised FIR stage

This can be improved further by making use of another Masked Subsystem utility block from the Vector Utils library – the TappedDelayLine. See 2.4.3 above.

[pic]

Figure 15: Vectorized FIR stage using TappedDelayLine from Vector Utils Library

2 Flow control using FIFOs

FIFOs can be used to build flexible, ‘self-timed’ designs insensitive to latency, and are an essential component in building parameterizable designs with feedback, such as those that implement back-pressure.

This section describes the basic operation of the FIFO block. A specific example is used to illustrate some of the behavior and the requirements for safe operation.

Note that the DSP Builder Advanced Blockset FIFO is a single clock FIFO in ‘show-ahead’ mode. That is the read input, r, is a read acknowledge which means ‘I have read the output data, q, from the FIFO, so you can get rid of it and show the next data output on q.’ The data presented on q is only valid if the output valid signal, v, is high.

3 FIFOs for flow control and back-pressure

[pic]

Figure 16: demo_back_pressure example design

This design shows how back pressure from a downstream block can halt upstream processing. There are 3 FIRs that are designed using conventional DSPBA techniques (see Figure 5 above). Each FIR is followed by a FIFO that can buffer any data that is flowing through the FIFO. If the FIFO becomes half-full then the ready signal back to the upstream block is asserted. This prevents any new input (as flagged by valid) entering the FIR block. The FIFOs always show the next data if it is available and the valid signal is asserted high. This FIFO valid signal must be ANDed with the ready signal to actually consume the data at the head of the FIFO. If the AND result is high then we can consume data because (1) it is available and (2) we are ready for it.

Several blocks can be chained together in this way, and no ready signal has to feed back further than one block. This allows modular design techniques with local control to be used.

The delay in the feedback loop represents the lumped delay that will be spread throughout the FIR block. The delay must be at least as big as the delay through the FIR. This delay is not critical. Experiment with some values to find the right one. The FIFO must be able to hold at least this many data items after full has been asserted. This means that the full threshold must be at least this delay amount below the size of the FIFO (64-32 in this example).

The final block uses an external ready signal that will come from a downstream block in the system.

4 Some notes on safe operation of FIFOs

[pic]

Figure 17: Simple FIFO model, used to illustrate considerations in safe usage

Where the user has to be careful is in acknowledging reading of invalid output data. This will be illustrated with an example. In the design shown above, the FIFO parameters are depth = 8, fill_threshold = 2, fill_period = 7. The resulting ModelSim behavior of the FIFO hardware:

Note that there is a three-cycle latency between the first write and valid going high. The q output has a similar latency in response to writes. The latency in response to read acknowledgements is only 1 cycle for all output ports. The valid out goes low in response to the first read, even though two items have already been written to the FIFO. This is because the 2nd write is not older than 3 cycles when the read occurs.

Note also that with the fill_threshold set to a low value, that the t output can go high even though the v out is still zero. Also, the q output stays at the last value read when valid goes low in response to a read.

Problems can occur when no feedback is used on the read line, or if the feedback is taken from the t output instead with fill_threshold set to a very low value (< 3). A situation may arise where a read acknowledge is received shortly following a write but before the valid output goes high:

In this situation, the internal state of the FIFO doesn't recover for many cycles. Instead of attempting to reproduce this aberrant behavior, the Simulink implementation issues a warning when a read acknowledge is received while valid output is zero. This intermediate state between the first write to an empty FIFO and the valid going high, highlights another aspect of FIFO behavior to be aware of: that the input to output latency across the FIFO in different in this case. This is the only situation when the FIFO behaves with a latency greater than 1 cycle. With other primitive blocks – which have consistently constant latency across each input to output path the model designer never has to consider these intermediate states. This is not so for the FIFO.

This issue can be sufficiently mitigated by proper care when using the FIFO. The model needs to ensure that the read is never high when valid is low using the simple feedback as shown above. And if the read input is derived from the t output, ensure that a sufficiently high threshold is used. This is made explicit in the following points.

1. Due to differences in latency across different pairs of ports: from w to v is 3 cycles, from r to t is 1 cycle, from w to t is 1 cycle; it is possible to set fill_threshold to a low number (= Depth of FIFO – L.

2 Avalon-ST Input

The signals that interface to the external system are:

|sink_ready |output |indicates to upstream components that the DSPBA component can accept sink_data on this rising |

| | |clock edge |

|sink_valid |input |indicates that sink_data, sink_channel, sink_sop and sink_eop are valid |

|sink_channel |input |channel number |

|sink_sop |input |indicates start of packet |

|sink_eop |input |indicates end of packet |

|sink_data |input |the data (which may be, or include control data) |

The signals that interface internally with the DSPBA design component are:

|input_ready |input |indication from the output of the DSPBA component that it can accept sink_data on this rising |

| | |clock edge |

|input_valid |output |indicates that input_data, input_channel, input_sop and input_eop are valid |

|input_channel |output |channel number |

|input_sop |output |indicates start of packet |

|input_eop |output | indicates end of packet |

|input_data |output | the data (which may be, or include control data) |

3 Avalon-ST Input FIFO

Another version of the input interface includes FIFOs.

4 Extending the interface definition

These Avalon-ST interfaces are provided as Masked Subsystems. As such the user can look at the internals and make edits if required. Too look under the mask right click on the block and select ‘Look Under Mask’. Under the mask the user will see a primitive subsystem. The user is free to edit this; but will first have to break the library link if they wish to do so. (Right click on the block and select ‘Link Options’ > ‘Disable Link’, then right clock again and select ‘Link Options’ > ‘Break Link’). If editing the interface blocks, the user should not edit the ‘Mask Type’ field. This is used internally to identify the subsystems defining the interfaces.

1 Adding further ports to the Avalon-ST blocks

Further ports can be added in the Avalon-ST Marked Subsystems. Internally these would have to be connected up by the user; most likely in the same fashion as the existing signals – for example through FIFOs. If adding further inputs or output ports which are to be connected to the device level ports, then these should be ‘tagged’ with the role the port will take in the Avalon-ST interface. Simulink ports are tagged via the Block Properties, General tab for the individual port. (This is used internally to get the port role – valid, ready, endofpacket, startofpacket etc. that the particular port will take in the interface). Other ports you may want to add are ‘error’ and ‘empty’, for example.

2 Adding custom text

Any text written to the Description field of the Masked Subsystem (Block Properties, General tab on the subsystem block itself) will be written verbatim - with no evaluation - into the hw.tcl file immediately after the standard parameters for the interface and before the port declarations. It is the user’s responsibility to get the text of any addition correct.

[pic]

Figure 27: Example Block Properties GUI for Avalon-ST Masked Subsystem showing Description field

[pic]

5 Restrictions on use

1 Intervening blocks

Although the Avalon-ST interface blocks can be put in different level of hierarchy, no blocks – Simulink, ModelIP or primitive – should be placed between the interface and the device level ports.

2 Interfaces with multiple data ports

The Avalon-ST specification only allows a single data port per interface. This means that adding further data ports, or even using a vector through the interface and device-level port (which creates multiple data ports) is not allowed.

To handle multiple data ports through a single Avalon-ST interface they must be packed together into a single (not vector or bus) signal, then unpacked on the other side of the interface. The maximum width for a data signal is 256 bits.

The packing and unpacking can be done with BitCombine and BitExtract blocks, see below for example.

3 Using ModelPrim subsystems

1 Interfaces as subsystem boundaries

Primitive subsystems allow users flexibility to build their own custom designs, while taking advantage of the optimizations applied by the tool. Optimizations operate hierarchically – that is each primitive subsystem or ModelIP block is optimized individually.

The boundary of a Primitive Subsystem is delineated by the primitive I/O blocks – ChannelIn, ChannelOut, Gpin or GPOut. A primitive subsystem should always include I/O blocks from this set.

A SynthesisInfo block, with style set to Scheduled should also be used at the same hierarchical level as these I/O blocks.

Further subsystems can be used within the Primitive subsystem. These are flattened and treated holistically with the primitive subsystem in the optimizations.

ModelIP blocks cannot be used inside primitive subsystems.

2 Interfaces as scheduling boundaries

The type of I/O boundary block used determines how the set of signals through it are scheduled during register pipeline insertion. I/O signals through the same I/O block (ChannelIn, ChannelOut, GPIn or GPOut) are scheduled together – i.e. will remain in sync. Using ChannelIn and ChannelOut allows specification of Advanced Blockset protocol valid and channel signals alongside the data. GPIn and GPOut is for other – general purpose – data which doesn’t necessarily have to be scheduled to start or end on the same clock cycle as other I/O signals.

So if you want all your signals to be pipelined such that inputs are all on the same clock cycle and outputs are all synchronized together on the same output clock cycle, use a single ChannelIn and a single ChannelOut block. This is the usual mode of operation. If your subsystem requires inputs appearing on different clock cycles, or outputs grouped on to different clock cycles, you can use multiple ChannelIns and ChannelOuts, or GPIns and GPOuts.

Note that to maintain cycle accuracy at a level outside the primitive subsystem, the pipelining inserted by the tool must be accounted for in simulation. This added latency is calculated by the scheduler and depends on factors such as vector widths, data types, and fmax requirements. So Simulink can only model this after the scheduler has run. Since each pipelining stage is added in slices, or cuts, through all parallel signals, this can be modeled by just adding a latency on the inputs or outputs.

Note that in some cases this accounting for the latency may give different results to hardware – in particular a) in the case of optimizing away parallel registering that just delays all signals by the same amount and b) in the case of multiple inputs and outputs where an input is scheduled after an output.

In these cases a warning is given:

“Warning: Negative IO correction on block //ChannelOut. Simulink will not match hardware.”

1 Optimizing away parallel registering

Suppose a primitive subsystem design had a line of registers on each input to output path (see the example below). These registers do not alter the function of what the primitive subsystem does – only alter when the output comes out – i.e. the latency.

Advanced Blockset seeks to optimize the hardware it produces, while minimizing the latency required to achieve the same functionality and still achieving the desired fmax. SampleDelays that specify relative differences in latency on paths are therefore important, but cuts of SampleDelays that specify identical latency on all paths are effectively ignored.

[pic]

Figure 29: Simple primitive subsystem with line of identical registers across all input to output paths

As such, the registers in the above design which delay all paths by 1 cycle would be optimized away – as they can be removed without changing the functionality of the subsystem (only its latency).

In this case the generated hardware contains no registers – and has zero latency – but the Simulink model of the subsystem as a whole will have a latency of 1 from the Sample Delays. The ChannelOut block cannot correct to simulate the actual latency of the hardware produced by adding negative latency and displays the message:

“Warning: Negative IO correction on block //ChannelOut. Simulink will not match hardware.”

If you want the latency to be higher than the optimal (lowest latency) solution the way to do this is to use the constraints on the SynthesisInfo block:

[pic]

Here the latency for this subsystem is constrained to be greater or equal to 1. Now on generation the vertical line of pipelining is preserved in hardware such that simulation and hardware both have latency 1 and there is no error message. (Note; the SampleDelays aren’t needed at all in this case - it is sufficient to use the constraint alone to add latency).

Here is another example.

[pic]

Figure 30: Another simple primitive subsystem with line of identical registers across all input to output paths

Remember that adding vertical cuts of delays across all paths is not the recommended way of adding latency – if that is really what is desired. The way to do this would be to constrain the latency to the higher value using the SynthesisInfo block.

DSP Builder Advanced solves for the minimal pipelining delay required to attain the target fmax while maintaining the relative delay differences implied by the user-inserted SampleDelay blocks.

Adding vertical cuts of extra SampleDelays across all signals does not change the relative latency of the signals so does not alter the optimization problem, or the hardware that would be produced.

In the design shown there are no relative differences in the latencies between the paths (each has 10), so the optimization is free to remove them, and the hardware produced for this design will be the same as if no SampleDelays were specified. In this example, to be sure of attaining the desired fmax the multiplier needs a latency of 4 clock cycles (that is 4 pipelining registers), and this will be balanced by delays inserted on the other paths (the valid and the channel signal) to maintain no relative latency difference.

The final solution will therefore be that the HDL will have a multiplier with 4 pipelining stages, to meet timing on the data path, and 4 registers on each of the valid and channel paths, such that the outputs maintain their relative synchronization.

So to simulate this cycle accurately in Simulink the tool would have to delay the signal by a total of 4 clock cycles across the subsystem. However, the Simulink schematic design already has SampleDelays that are delaying the signal by 10 clock cycles – and the tool cannot correct for negative delays. It can’t reset these Sample Delays in simulation, or jump the simulation forward 6 cycles (applying a negative correction) in the ChannelOut to compensate.

What you will get now is zero latency correction applied in the ChannelOut block, and the warning given;

“Warning: Negative IO correction on block //ChannelOut. Simulink will not match hardware.”

The above designs break two ‘good design rules’ that should be followed when designing with Advanced Blockset

a) Don’t try to pipeline the design yourself – this is what the tool does for you using mathematical linear programming techniques.

b) Don’t try to synchronize the output of different parallel subsystems using explicit delays. Use FIFOs, as this will give a more device-portable, fmax target independent design.  

2 Multiple Scheduled Outputs

[pic]

Figure 31: Simple primitive subsystem with independently synchronized outputs

Consider the above model. Here there is a single set of inputs, scheduled to be input on the same clock cycle. Any pipelining inserted ensures these signals stay in sync. For the particular fmax chosen in this example the multipliers require a latency of 3 (that is 3 pipelining stages). Parallel signals are pipelined by the same amount to keep the signals in sync. Thus on ‘ChannelOut’ there are three cycles of latency added to all signals through this I/O block and scheduled to be output together. For the ‘ChanelOut1’ signals however, the latency of the path is 6 clock cycles (3 for each multiplier), so the output here has a latency of 6 – appearing three cycles after the outputs from ‘ChannelOut’.

The latencies are modelled by delaying the signals through ‘ChannelOut’ by 3 cycles, and through ‘ChannelOut1’ by 6 clock cycles.

3 Multiple Scheduled Inputs & Outputs

Things are a little more complex in the case of multiple inputs and multiple outputs. Models such as demo_back_pressure have more than one input block and more than one output block in the single primitive subsystem.  

Models with multiple input blocks imply that in hardware the data entering the system will not be synchronised between the different points of entry.

[pic]

Figure 32: Simple primitive design with multiple input groups and multiple output groups

Simulink is unaware of the 1 cycle latency needed by the adders. This has to be worked out in the scheduler and corrected for afterwards during simulation using simulation delays in the input/output blocks A, B, C, X and Y.

This model implies the following constraints:

A + X = 1

A + Y = 0

B + X = 2

B + Y = 1

C + X = 1

C + Y = 1

This system of equations has no solution! However, this situation never actually occurs because of the scheduling constraints that have been enforced over the primitive subsystem. The scheduler actually inserts an extra sample delay between Add2 and ChannelOut X which makes the whole system solvable. 

Note that there are some configurations of input and output blocks that, if accounting for latency at the inputs and outputs, would require latency correction of negative size, implying discarding the first few samples. Consider:

[pic]

Figure 34: example of a multiple I/O primitive subsystem where an input is scheduled after an output

 

Assume that multipliers have a delay need 3 times that of adders. Any solution for latency correction requires a negative sized buffer on at least one of the input/output blocks. A bigger example of this is demo_back_pressure.

Discarding the first few samples is not possible in the discrete simulation that Simulink uses, however this can still be done for the stimulus files. When this occurs a warning message is given to make the user aware that their Simulink model will not behave exactly as the hardware does.

3 ModelPrim subsystem design styles to avoid

1 Primitive Subsystems with logic not driven from clocked inputs

This section describes some design styles to avoid. Usually this is because either hardware behavior will be determined by reset behavior or the hardware will be inefficient.

[pic]

Figure 35: Blocks not driven from clocked blocks

In this model, the will start straight out of reset. Perhaps you don’t care the specific phase through the repeating count with respect to your data, but even so because of the out-of-reset start the design simulation in Simulink may not match that of the generated hardware. A better option here would be to start the counter off the valid signal, rather than the constant. If the counter were to repeat without stopping after the first valid, then this could be achieved by adding a zero-latency latch from the Additional Libraries > Control into this connection.

Similarly, loops driven without clocked inputs should be avoided for the same reason.

4 Common Problems

1 Timed Feedback Loops

Care also has to be taken with feedback loops generally, in particular in providing sufficient delay around the loop.

[pic]

In this model, there is a cycle containing two adders with only a single sample delay which is not sufficient. In automatically pipelining designs, a schedule of signals through the design is created. From internal timing models, the tool knows how fast certain components, such as wide adders can be run – or rather how many pipelining stages they require in order to run at a chosen clock frequency. The tool must account for the required pipelining while not changing the order of the schedule. The single sample delay is not enough to pipeline the path through the two adders at the chosen clock frequency. The tool is also not free to insert more pipelining here, as this would change the algorithm, accumulating every n cycles, rather than every cycle. The scheduler detects this and gives an appropriate error indicating how much more latency would be required in the loop for it to run at the chosen clock rate. In multiple loops this error may be hit a few times in a row as each loop is balanced and resolved.

2 Loops, clock cycles and data cycles

It is important not to confuse clock cycles and data cycles - particularly in relation to feedback loops where, for example, you may want to accumulate with previous data from the same channel. The Multi-Channel IIR Filter design (demo_iir) shows an example of feedback accumulators processing multiple channels. Note that in this example each consecutive data sample on any particular channel is 20 clock cycles apart. This number is derived from clock rate / sample rate.

Supposing we only had one channel, at a low data rate. This is explored in the Folded IIR Filter Demonstration (demo_iir_fold2)

This model implements a single-channel infinite impulse response (IIR) filter with a subsystem built from primitive blocks folded down to a serial implementation.

The design of the IIR is identical to that in the multi-channel example, demo_iir. Note that as the channel count is 1, the lumped delays in the feedback loops are all one. This would present a scheduling problem if running at full speed, i.e. with new data arriving every clock cycle, as the lumped delay of one cycle would not be enough to allow for pipelining round the loops. However, here the data is arriving at a much slower rate than the clock rate, in this example 32 times slower (the clock rate in the design is 320 MHz, and the sample rate is 10MHz) - giving us 32 clock cycles between each sample.

One way to design for this would be to set the lumped delays to 32 cycles long - the gap between successive data samples; but this would obviously be very inefficient both in terms of register use and in underutilized multipliers and adders. Instead, we use folding to schedule the data through a minimum set of fully utilized hardware.

Set the SampleRate on both the ChannelIn and ChannelOut blocks to 10MHz. This informs the synthesis for the Primitive Subsystem of the schedule of data through the design – that even thought the clock rate is 320MHz, each data sample per channel is arriving only at 10MHz. The produced RTL is folded down – in terms of multiplier use – at the expense of extra logic for signal muxing and extra latency.

5 ModelPrim Blocks outside primitive subsystems

ModelPrim blocks can also be used outside of primitive subsystems (i.e. outside subsystems delineated with ModelPrim I/O blocks, but these will not be scheduled or pipelined for fmax. A common use is for constants, and inside the synthesizable part of the design, a ModelPrim constant blocks should always be used in preference to a Simulink constant block.

As with inside primitive subsystems, logic dependent on initial behavior out of reset for synchronizing should be avoided. For example,

The logic here driving the sample delay was intended to produce a single pulse after reset. A better solution is to use a 1-cycle latch on the valid signal

6 Convert blocks vs. specifying output types via dialog

There are two ways to change data type with Advanced Blockset primitives:

1. preserving real world value (Convert block)

2. preserving bit pattern (set ‘Output data type mode’ on any other primitive)

The Convert block converts a data type preserving the real word value – optionally rounding and saturating where this is not possible. Convert blocks can therefore sign extend or discard bits as necessary. For example, the following Convert block will discard 11 LSBs and sign extend the MSBs by 27 bits while preserving the real world value, as far as possible.

Here for example you can see that some truncation has occurred.

Similarly you can convert the same number of bits while preserving the real world value (as far as possible, subject to rounding and saturation) (see below).

We can contrast that with setting the type using the “Specify via dialog” option on any other primitive. We can do this without any generated hardware by using a zero-length Sample Delay for example.

WARNING: If you want to reinterpret the bit pattern and also discard bits, note that if the type specified via dialog in the Output data type mode is smaller than the natural (inherited) output type, MSBs (most significant bits) will be discarded. In the example below the output type is set to ufix14_En1 and the top two MSBs are discarded, giving a very different result.

Users should NOT set the type via dialog to be bigger than the natural (inherited) bit pattern: no zero-padding or sign extension will be done, and the result may generate hardware errors due to signal width mismatches. Any sign extension or zero padding should always be done with a Convert primitive block.

Often you may want to do both – sign extend or zero pad, then reinterpret the bit pattern (or vice-versa), in which case you can combine these methods.

In some instance all that may be desired is to set a specific format so that types can be resolved; in feedback loops for example. This is where setting a type via a dialog on an existing primitive or inserting a zero-cycle Sample Delay with type specified is useful (where we choose a zero-cycle delay as this generates no hardware and just casts the type interpretation).

Not that in some cases, you may just want to ensure the data-type is equal to some other signal data-type. In such cases you can force the data-type propagation using a Simulink data-type propagation block. An example of this is in the Latch masked subsystems from the Control library covered above.

Debugging Designs

(to be added)

Floating Point

The Advanced Blockset ModelPrim blocks now support floating point ‘single’ and ‘double’ data types. The tool generates a parallel data-path optimized for Altera FPGAs from the Simulink model.

In many cases, a 50% reduction in logic resources and a 50% reduction in latency are possible, over using discrete IEEE754 operators. The Advanced Blockset achieves these improvements by optimizing over the entire data-path: considering the sequence of operations. By using the hard logic resources (DSP Blocks) effectively, and by grouping functions in the data-path, many steps in IEEE754 implementations effectively become redundant.

The input and output values to and from the data-path will be IEEE754 compliant for floating point numbers, but a different format is used internally. There will likely be some small differences between the output generated by the data-path, and the Simulink simulation of the input file. As the Advanced Blockset generated data-path generally uses greater mantissa and exponent precision than IEEE754, many of these errors will be because floating point operations are non-associative.

1 Support Outline

1 Blocks

The Advanced Blockset now supports the single and double precision floating point data types. This section details the initial limitation of the support. First note that only primitives support floating point, not yet ModelIP blocks such as FIR, NCO and CIC.

1 Support in existing ModelPrim blocks

Most of the existing primitive blocks now support floating point, as well as the I/O blocks, these are shown below

[pic]

Single and double can be found as selections on the data type field

[pic]

Figure 42: Output data type selection UI - showing single and double as options

Note that in most cases, the output data type is used to fix the type, rather than convert.

2 Conversion between floating and fixed

For conversion to and from floating point ReinterpretCast, Convert or BitExtract can be used. For example floating point format numbers can be converted to a flat 32bit representation using the bit extract for transmission through to a higher level DSP Builder Standard design. Reinterpret Cast generates no hardware – just changes how the bit pattern is interpreted and propagated by the tool

[pic] [pic]

3 New floating point ModelPrim blocks

In addition new floating point blocks have been added to the primitive library to support common math functions. Most of these have multiplier-based implementations and have a size typically about 3 to 4 times that of a corresponding floating point multiply.

Further details of each block can be found in the help for the specific block.

[pic]

Figure 43: New floating point primitive blocks in 10.1 Advanced Blockset

Note that in the first release the trigonometric functions support single precision only, and that none of these blocks support fixed point. If desired they can be used in otherwise fixed point designs by converting from and to floating point either side of the block.

2 Interaction with other features

1 Folding

Currently the folding feature is not enabled for use with floating point blocks.

2 Pipelining flexibility within floating point operations

Currently some of the floating point functions, such as the trigonometric functions, are of fixed latency. As such the depth of pipelining within these does not vary with target fmax. These functions are targeted at high clock rates. Flexible pipelining control within floating point operations will be supported in a future release.

3 Accuracy, Testing & Automatic Test-benches

The Advanced block-set uses IEEE floating point format at the inputs and outputs. Simulation is handled on the primitive blocks themselves using Matlab single and double precision arithmetic. Internally – within the hardware generated for the data path – more bits of precision are used. It is possible therefore that the hardware result, as seen when running a hardware simulation, may occasionally differ in least significant bit to the Simulink simulation.

In the automatic test-benches therefore, we compare the numeric results to within a tolerance.

1 Understanding arithmetic accuracy

Note that while only a difference in the least significant bit would normally be expected, because floating point arithmetic is non-associate it is possible to get larger differences.

With floating point arithmetic, algorithms that are iterative and have large dynamic range are now implementable. In such algorithms is possible that the designs themselves may be ill-conditioned, that is sensitive to very small errors or differences.

For example consider the problem of QR decomposition and back/forward substitution using an ill-conditioned matrix. Matlab functions are available for checking for such cases. For example, cond() gives the condition number of a matrix, which measures the sensitivity of the solution of a system of linear equations to errors in the data. This gives an indication of the accuracy of the results from matrix inversion and the linear equation solution, with condition values near 1 indicating a well-conditioned matrix.

For single precision, the HDL internal floating-point representation (which uses a 32-bit mantissa of which 26 bits are in use most of the time) is compared to Simulink single precision (24-bit mantissa, counting the sign bit).

At each individual step, it can be confirmed that the floating point additions and subtractions are being performed correctly, and that the differences are no larger than what one would expect.

Relatively large differences can still occur when subtracting numbers that are very close in value (i.e. such that after alignment of mantissas to equalize the exponent, the subtraction would zero out the first 6 to 16 most significant bits). Here then we may introduce a deviation in the output of the Simulink model and the generated HDL, largely due to numerical round-off.

The measuring of results against that produced using IEEE single precision computation has to be understood in terms of this accuracy, and not in terms of absolute error. Given the generally higher precision of the internal floating point format used by the generated HDL, it could be that the Simulink single precision answer is more "wrong" in this case – but the reason for potential differences should be understood.

Such numeric differences can be exacerbated by an ill-posed problem – for example by the ill-conditioning of the matrix used in forward/backward substitution. Here such differences can be iterated and multiplied. For this case, typically the way to address this ill-conditioning is to improve it via pivoting at the QR decomposition stage, which involves reordering matrix columns.

Users designing floating point algorithms should therefore understand concepts such as ill-conditioning and use Matlab features such as cond() to check their design and in the analysis and understanding of results.

4 Device Support

While all devices are supported, the hardware generated is currently most optimized for Stratix III, IV and V. Future releases will also optimize for other device families.

2 Floating Point Format

The internal word formats are important to understanding the generated hardware, should you need to debug it. The word formats are different during addition and subtraction, multiplication, division, and functions. Cast blocks are automatically inserted by the tool to convert to one format to another

In the case of single precision, the internal mantissa is 32 bits wide with 1 sign bit, and the exponent is 8 bits.

1 Single Precision Word Formats

Internally a number of extended floating point formats are used across different floating point operations.

1 IEEE 754

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−149 ≈ 1.4 × 10−45 |2−126 ≈ 1.18 × 10−38 |(2−2−23) × 2127 ≈ 3.4 × 1038 |

In IEEE754 format, the sign bit is in the most significant bit, followed by an 8 bit exponent, followed by the 23 bit fractional part of the mantissa.

2 Internal Single Precision Floating Point Number

In addition to IEEE754 used at the subsystem boundaries and memories, there are two internal single precision formats; a signed one for addition and subtraction, and another unsigned for multiplication and division. Both formats have a 32 bit mantissa followed by the 10 bit exponent.

Signed Single Precision Format - Addition and subtraction

[pic]

Unsigned Single Precision Format - Multiplication and division

[pic]

Also there are 3 flag bits for Saturation (Inf), Zero, and ‘Not a Number’ (NaN).

1 Addition and Subtraction Format

For addition and subtraction operations, the format upon conversion from IEEE754 single precision is:

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−536 ≈ 4.4 × 10−162 |2−510 ≈ 2.98 × 10−154 |(32−2−26) × 2511 ≈ 2.1 × 10155 |

The format is just fixed point, plus an exponent. Conversion from IEEE is then easy – just pad with sign and zeros. Conversion from this format back to IEEE is harder, requiring detection of sign, use of absolute values, counting leading zeros, shifting etc. This is all done internally by the tool in the generated hardware.

Adding numbers together is also simple, with word growth into the overflow bits.

These four overflow bits (one is sign) allow for 16 un-normalized additions to feed into a single node without overflow. Underflow may happen more quickly, due to bit cancellation, but the effects of underflow are reduced by normalizing more often where necessary, again handled in the generated hardware.

2 Multiplication and division format

The multiplier has a slightly different input number format. A fully normalized multiplier input format for the 32 bit mantissa is a signed number:

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−540 ≈ 2.8 × 10−163 |2−510 ≈ 2.98 × 10−154 |(2−2−30) × 2511 ≈ 1.34 × 10154 |

The multiplier input is always normalized to prevent overflow. If there is significant underflow in the part of the data-path feeding the multiplier, the number could be very small. If the other number is very small as well, the multiplier could produce a zero output, as the new mantissa will be expected in the top half of the multiplier output.

In the internal format, the sign bit is part of the mantissa. The mantissa is a 32 or 36 bit signed number, with the entire mantissa (including the implied ‘1’) rather than just the fractional part. The exponent follows the mantissa.

In addition, two bits are always associated with every internal floating point number; a saturation signaling bit and a zero signaling bit. Rather than calculating an infinity or zero condition at every operation, the functions forward saturation and zero conditions detected at the input of the data-path. These are then combined with the conversion (cast) back to IEEE754 at the output of the data-path to determine special conditions.

2 Double Precision Word Formats

Generally, the double precision word formats are analogous to the single precision word formats.

1 IEEE

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−1075 ≈ 2.5 × 10−324 |2−1022 ≈ 2.2 × 10−308 |(2−2−52) × 21023 ≈ 1.8 × 10308 |

In IEEE754 format, the sign bit is in the most significant bit, followed by an 11 bit exponent, followed by the 52 bit fractional part of the mantissa.

2 Internal Double Precision Floating Point Number

In addition to IEEE754 used at the subsystem boundaries and memories, there are two internal double precision formats; a signed one for addition and subtraction, and another unsigned for multiplication and division. In the signed format, the 64 bit signed mantissa is followed by the 13 bits exponent, while the unsigned format has the 54 bit mantissa followed by the 13 bit exponent.

Signed Double Precision Format - Addition and subtraction

[pic]

Unsigned Double Precision Format - Multiplication and division

[pic]

The saturation and zero signaling bits operate in the identical way to the single precision case.

Also there are 3 flag bits for Saturation (Inf), Zero, and ‘Not a Number’ (NaN).

1 Addition and Subtraction Mantissa

A signed 64 bit mantissa is used internally. The mantissa from the IEEE format becomes part of the sfix64_En58 signed fractional number -

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−4152 ≈ 1.3 × 10−1250 |2−4094 ≈ 3.8 × 10−1233 |(32−2−58) × 24095 ≈ 1.7 × 101234 |

As with the single precision mantissa, there are four overflow bits (i.e. 4 additional integer bits compared to IEEE) so that 16 additions can feed into any node without overflow. There are six underflow (guard) bits.

2 Multiplication, Division and Function Mantissas

The multiplier and divider have the same format, which is different from the signed mantissa.

[pic]

|minimum positive (subnormal) value |minimum positive normal value |maximum representable value |

|2−4146 ≈ 8.5 × 10−1249 |2−4094 ≈ 3.8 × 10−1233 |(32−2−52) × 24095 ≈ 1.67 × 101234 |

The sign bit is packed with the mantissa, but the multiplication or division operation is performed on an unsigned 54 bit mantissa. As with single precision, the function library mantissa is the same as the division mantissa, except that some functions only have a valid positive output.

The mantissa is 54 bits wide, consisting of a leading “01” and a 52 bit fractional part. The exponent is 13 bits wide, and is signed. As with the single precision internal format, the additional width is used for local overflow and underflow: i.e. the exponent can exceed 2046 locally and be less than 0 locally before normalization. As with the IEEE754 format, the exponent is offset, where a value of 1023 denotes 1 (20), and 0 denotes (2-1023). De-normalized numbers are not supported, but in cases where a node temporarily is less than (2-1023) can be accommodated if the node increases to (2-1022) before the next conversion to a IEEE754 number (i.e. an output).

3 Floating Point Type propagation

Cast blocks are automatically inserted to convert between formats, optimal for the type of operation. Below is an example:

Input & Output is always IEEE 754 format, single or double

The IEEE format is propagated though to memory

• Note, memory always stores data in IEEE, even in feedback loops

Multipliers can take IEEE format

• Multiplier can generate multiplier format

In this example the 2nd multiplier produces add-format

Adder needs add-format

• Generates add-format too

Output is IEEE

• A cast operation is inserted just before output

3 Special considerations when using floating point

Algorithms are often hand folded down to reduce the total resources used, while maintaining the required data throughput.

For example, most folded algorithm implementations assume single-cycle accumulators, which permit partial calculations to be performed in adjacent clock cycles, and for the control to be written in a natural way.

However, to meet high fmax, floating point accumulators are at least 6 cycles for single-, and 10 cycles for double-precision. This then requires a rethink of how such algorithms should be implemented.

A delay-line adder-tree is a typical structure in DSP designs. But with the latency required for floating point, this could be quite sizable in resources, and would add also latency to the overall calculation. If calculations are performed ‘out-of-order’ however, we can often build a more hardware efficient implementation, at the expense of thinking carefully about the control.

The goal when designing with floating point in the Advanced Blockset is to build simple designs that are still efficient. The following sections disclose a set of structures that can be used for efficient floating point design, and algorithmic transformations to build them automatically. It covers;

■ Using FIFO based flow control to eliminate need for state-machines

■ Data-flow structures for processing iterative algorithms

■ Latency insensitive implementation

These techniques apply to simple designs, as well as to more complex linear algebra functions such as Cholesky and QR Decomposition. They may also be applied to fixed-point designs.

1 Flow Control, latency hiding and avoiding data dependencies

FIFOs are used to provide self-timed control. Rather than either relying on cycle-counting or on state-machines, FIFOs offer simple controlled access to memories. The aim is to have the floating point arithmetic running as fast as it can but, rather than issuing a command to ‘start processing’, then waiting for the latency of the calculation before being able to use the result, have the arithmetic unit running continually in advance. Results are continually pushed onto the back of the FIFO queue and pulled from the front by the downstream process. If this queue becomes too big (the FIFO is getting full) – ensure we feed this back to stall the processing for a while, such that we don’t lose any results, while still being able to store any results currently in mid-calculation.

An example of this can be seen in the Mandelbrot demonstration design, where such units are used together.

Note that pipelining cannot add extra latency around loops – only balance and redistribute existing algorithmic latency. Therefore, although we do not care particularly about the latency round the loop, we have to specify sufficient delay round it in the design that the pipelining solver will be able to redistribute it to meet timing without needing to add further delay. In the Mandelbrot example, this is seen as the ‘loop slack’ sample delays in each loop.

1 Example: Floating Point Mandelbrot Set calculation

This example plots the Mandelbrot set for a defined region of the complex plane.

A complex number C is in the Mandelbrot set if

zn+1 = zn2 + c

remains bounded. That is, if the value remains finite when repeatedly squared and added to the original number. Further we can shade values of C depending on the speed of divergence.

[pic]

Single precision floating point complex numbers are used.

One thing to note is that the latency of the system is longer performing floating point calculations than would be for the corresponding fixed point calculations. You can’t afford therefore to wait around for partial results to be ready if you want to achieve maximum efficiency. Instead you must design to keep the floating point math calculation engines of your algorithm busy and fully utilized. In the summary below you can see there are two floating point math subsystems: one for scaling and off-setting pixel indices to give a point in the complex plane, and the other to do the main square-and-add iteration operation.

For this simple design, the total latency is approximately 25 clock cycles - depending on target device and clock speed – not excessive; but long enough that it would be very inefficient to wait around for partial results.

Instead we have the circulation of data through the iterative process controlled by FIFOs. The FIFOs ensure that if a partial result is available for a further iteration in the zn+1 = zn2 + c progression, then that point is worked on; otherwise a new point (new value of c) is started. Thus a full flow of data is maintained through the floating point arithmetic. This main iteration loop can exert back-pressure on the new point calculation engine. Here if new points are not being read off the ‘CommandQueue’ FIFOs quick enough, such that they fill up, the loop iteration over points will be stalled. In this way we don’t explicitly signal the calculation of each point when it is required (and then pay the penalty of waiting around through the latency cycles before we can use it), nor do we attempt to exactly calculate this latency in clock cycles and try to issue ‘generate point’ commands the exact number of clock-cycles before we need it) – which would take two compiles to do, and have to be changed each time we re-targeted device, or changed target clock rate. Instead we calculate the points as fast as we can from the start, catch them in a FIFO, then only if the FIFO starts to get full to we catch this – a sufficient number of cycles ahead of being full that we can stop the calculation upstream without loss of data. This is a self regulating flow, that mitigates latency while remaining flexible.

Not designing algorithm implementation around the latency and availability of partial results would lead to significant inefficiencies. If you’re not careful, data dependencies in processing can stall processing.

There are several other things of note in this design.

1. The 'FinishedThisPoint' signal is used as the valid. Thus although the system constantly produces data on the output, only when we have finished a point do we mark the data as valid. Downstream components can then just process valid data – just as the enabled subsystem in the design test-bench captures and plot the valid points.

2. In both feedback loops, we need to allow sufficient delay for the scheduler to redistribute as pipelining. In feed-forward paths pipelining can be added without changing the algorithm itself – just the timing of the algorithm. But in feedback loops, insertion of delay can alter the meaning of an algorithm. (think for example of adding N cycles of delay to an accumulator loop – this would then increment N different numbers each incrementing every N clock cycles). So in loops we have to give the scheduler in charge of pipelining for timing closure enough ‘slack’ in the loop to be able to redistribute this delay to meet timing, while not changing the total latency round the loop, and thus ensuring the function of the algorithm is unaltered. Such ‘slack’ delays can be seen in the top level of the synthesizable design in the feedback loop controlling the generation of new points, and in the FeedBackFIFO subsystem controlling the main iteration calculation.

These slack delays are set to the minimum possible delay that satisfies the tool’s scheduling solver using the Minimum Delay feature on the SampleDelays.

[pic]

The Sample Delay is set to minimum latency that satisfies schedule, which is solved as part of the integer linear programming problem used to find an optimum pipelining and scheduling solution for the design.

Delays can be grouped into numbered ‘Equivalence Groups’ to match other delays. In the Mandelbrot_S example, the single delay around the coordinate generation loop is in one equivalence group, and all the slack delays round the main calculation loop are in another equivalence group. The equivalence group field allows any Matlab expression that evaluates to a string.

The actual delay that is used is displayed on SampleDelay block.

3. The FIFOs operate in showahead mode - that is they display the next value to be read. The 'read' signal is a read acknowledgement - i.e. a signal to say 'I've read the output value, you can now discard it and show me the next'. Also note here that multiple FIFOs are used with the same control so will be FULL and present valid output at the same time. Thus we only need the output control signals from one of the FIFOs and can ignore the corresponding signals from the other FIFOs.

4. As floating point simulation is not bit-accurate to the hardware, it could be that some points in the complex plane take fewer or more iterations to complete in hardware compared to the Simulink simulation. This means that the results – when we have decided we are finished with a particular point – may come out in a different order. We therefore have to build a test-bench mechanism that is robust to this. To do this we use the test-bench override feature detailed in the appendix. We set the condition on mismatches to ‘Warning’ and use the Run All Testbenches block to set an import variable – to bring the ModelSim results back into Matlab, and a customer verification function which will be responsible for setting the pass/fail criteria. The example script for Mandelbrot_S is also given in the appendix.

2 Floating Point Matrix Multiply Example

For a matrix multiplication we need to do row x column dot product for each output element. Here each element in the red row in A is multiplied by the corresponding element in the red column in B to produce the red result element in AB.

Here, for 8x8 matrices A and B,

The naive approach would be to accumulate the adjacent partial results, or build an adder trees, without consideration of any latency. However, suppose we want to implement this using a smaller dot product; folding to use a smaller number of multipliers, rather than doing everything in parallel. We would do this by splitting up the loop over k into smaller chunks, as below for example. We then need to accumulate the red and blue partial products we can re-order the calculations to avoid adjacent accumulations.

A traditional implementation of a matrix multiply design would be structured around a delay line and an adder tree.

A11B11 + A12B21 + A13B31 + A14B41 + …..

• The length and size grow as Folding Size (typically 8-12)

• Implies adder tree of 7-10 adders that are only used once every O(10) cycles.

• Each matrix size needs different length, so must provision for worst case

[pic]

A better implementation is to use FIFOs to provide self-timed control. Here new data is accumulated when both FIFOs have data. The advantages are that the design

• Runs ‘as fast as it can’

• Is not sensitive to latency of dot-product on devices/fmaxes

• Is not sensitive to matrix-size (hardware just stalls for small N)

• Can be responsive to back-pressure which stops FIFOs emptying & full feedback to Control (not shown)

[pic]

Appendix: Generated Test-benches

The Automatic TestBench (ATB) for an entity under test foo consists of:

• foo.vhd – this is the HDL that is generated as part of the design (regardless of ATBs)

• foo_stm.vhd – this is an HDL file that reads in data files of captured Simulink simulation inputs and outputs on foo

• foo_atb.vhd – this is a wrapper HDL file that declares foo_stm and foo as components, wires the input stimuli read by foo_atb to the inputs of foo, and the output stimuli and the outputs of foo to a validation process that checks the captured Simulink data and channel matches the VHDL simulation of foo for all cycles where valid is high, and that the valid signals match.

• /.stm – this is the captured Simulink data, written by the ChannelIn, ChannelOut, GPIn, GPout and ModelIP blocks. Each block writes out a single stimulus file capturing all the signals through it writing them out in columns as doubles with 1 row for each timestep. For example:

The device-level testbenches make use of these same stimulus files, following connections from device level ports to where the signals are captured. Device level testbenches are therefore restricted to cases where the device-level ports are simply connected to stimulus capturing blocks. The picture below chows how these components are used to build a testbench around the generated HDL code

.

|Appendix: Overriding Test-benches in Matlab |

| |

|Override Verification Feature Overview |

|This feature allows the ModelSim simulation output to be imported into Matlab for verification and subsequent processing as required by the |

|application. This offers the user complete freedom over what verification and post-simulation processing is to be applied; setting new |

|pass/fail criteria for designs. The Matlab verification functions that a user might create here are expected to be very specific to the |

|application domain. |

|This feature is useful in verifying designs that are not expected to be bit accurate or cycle accurate. Example designs include those using |

|DSPBA’s floating point system. |

|Default Verification |

|The current automated test bench (ATB) generated by DSPBA consists of a hybrid of Tcl and VHDL. It applies one of three checks to each output|

|signal: |

|For traditional fixed point data the value must be an exact match with the stimulus files produced by the Simulink simulation |

|For floating point data-types, a relative error threshold is applied (currently set to 0.1% for single precision, 0.0001% for double |

|precision) |

|For fixed point signals in a model that also uses floating point, a fuzzy comparison is made using a threshold equivalent to the sum of the |

|least two significant bits. (e.g. for integer data 4 and 7 are considered equal but not 4 and 8. For sfix8_En3, 3.125 and 2.750 are |

|considered equal but not 2.5 and 3.0) |

|These comparisons are made independently for each signal. The ATB checks the real and imaginary parts of a complex number separately, and |

|vectors as individual components. This limits the utility of the ATB when applied to applications where vectors and complex outputs are. |

|How To Use |

|To improve the flexibility of the ATBs, a new experimental feature allows users to verify their ModelSim simulation using a custom Matlab |

|function. To do this, the ModelSim output has to be written to a file that can be imported into Matlab after the vsim process completes. The |

|following steps enable this feature: |

|If the model does not already contain a [Run All testbenches] block, add one from the Additional Libraries > Beta Utilities. Double click on |

|it to reveal the following dialogue window: |

| |

| |

|[pic] |

|Check the “Export device output from test-bench” and then enter a variable name in the "Import device output to variable" field. The other |

|fields can be left empty. Click Close to apply and close the dialogue. |

|Double click the Control block and make sure that both "Generate Hardware" and "Create Automatic TestBenches" are enabled. |

|Click the Start Simulation button or menu item in Simulink. |

|When this completes successfully you are now ready to run ModelSim. This can be done using script commands run_modelsim_atb(gcb), where gcb |

|is the Simulink path for the top-level or any primitive subsystem. Alternatively, run_all_atbs(model_name) will invoke vsim on all |

|subsystems. Both methods will automatically import the ModelSim results into the Matlab variable that was specified in step 2, provided |

|ModelSim didn't halt prematurely on a mismatch. |

|Launching ModelSim interactively via the Run Modelsim block will also write the output to file however Matlab will not import |

|them automatically. Once Modelsim has reached the end of the stimulus file, the user will need to invoke Matlab |

|command import_atb_basevar(gcb) to complete the import process. Matlab cannot detect when this has happened. |

|Due to the approximate nature of DSPBA's floating point implementation, the standard Modelsim based ATB can sometimes misfire -- incorrectly |

|flagging a mismatch and then halting. Select “Warning” under the "Action on Channel Mismatch” option to prevent this. The responsibility of |

|detecting mismatches will need to be taken over by a user-defined Matlab function, the name of which can be specified in the “Verification |

|function” entry. In order to write this function, one needs to understand the format of the data that was imported into the user specified |

|Matlab variable. |

| |

|Imported Data |

|The imported simulation results are stored as a containers.Map in the base workspace. The map is indexed by strings that are Simulink paths. |

|>> ks = vsimOut.keys |

|ks = |

|    'DUT'    [1x24 char] |

|>> ks{2} |

|ans = |

|DUT/Subsystem/ChannelOut |

|The output signals are grouped according to the ATB component that produced them: device level subsystem, Channel-Out, or GP-Out blocks. Each|

|signal can be accessed by field name. Their corresponding stimulus is also available for comparison. |

|>> vsimOut(ks{1}) |

|ans = |

|        v0: [5000x1 embedded.fi] |

|    v0_stm: [5000x1 embedded.fi] |

|        q0: [5000x1 single] |

|    q0_stm: [5000x1 single] |

|        c0: [5000x1 embedded.fi] |

|    c0_stm: [5000x1 embedded.fi] |

| |

|Verification |

|Traditional bit accurate matches are straightforward to perform: |

|all(vsimOut('DUT').c0 == vsimOut('DUT').c0_stm) |

|As well as the fuzzier kind: |

|>> err = vsimOut('DUT').q0 - vsimOut('DUT').q0_stm; |

|>> max(abs(err(vsimOut('DUT').v0 == 1))) < 0.001 |

|Notice how comparisons can be restricted to when the valid signal is high. The specific comparisons that are actually carried out will depend|

|on the application domain. More sophisticated designs (e.g. Mandelbrot) will also require re-ordering the output prior to comparison. |

|The verification function is passed the Map as a parameter. This m-function needs to be robust against empty structs in the Map which can |

|result when ModelSim is invoked on only a subset of ATBs.  |

|See demo_fpfft_1024_mr42.mdl and verify_fpfft.m for an example which dynamically sets threshold, and compares complex numbers. |

|See Mandelbrot_S.mdl and vsim_mb.m for an example which verifies results which can be output in a different order. |

| |

|Mapping |

|The "Import valid map to variable" field allows users to specify a variable in the base workspace that will store another containers.Map. For|

|each data signal, this Map indicates by name which valid signal is associated with it. |

|>> validMap('DUT') |

|ans = |

|         q0: 'v0' |

|    q0_stm: 'v0_stm' |

|        c0: 'v0' |

|    c0_stm: 'v0_stm' |

|>> validMap(ks{2}) |

|ans = |

|          cout: 'valid' |

|      cout_stm: 'valid_stm' |

|        result: 'valid' |

|    result_stm: 'valid_stm' |

|This should assist in writing reusable verification functions that are shared by different models despite having different port names. |

|Further development of this feature will depend on user requirements. In the future, there will be a library of commonly used verification |

|functions that users would want to extend and contribute to. |

| |

| |

| |

| |

Example: verification function for Mandelbrot design

function passed = verify_mb(vsim_mb)

% Verify Mandelbrot results

% The order of results is dependent on the floating point comparison,

% pixel colors can appear in a different order in Simulink and HDL.

% This function captures the outputs and plots both the ModelSim HDL

% and Simulink simulation results and where there are any pixels that

% differ.

passed = 0;

% In this design there is just one ChannelOut block

% ... 'DUT/ChannelOut'

% and the variables are

% qv: [120000x1 embedded.fi]

% qv_stm: [120000x1 embedded.fi]

% qc: [120000x1 embedded.fi]

% qc_stm: [120000x1 embedded.fi]

% qCoord: [120000x2 embedded.fi]

% qCoord_stm: [120000x2 embedded.fi]

% qColor: [120000x1 embedded.fi]

% qColor_stm: [120000x1 embedded.fi]

results = vsim_mb('DUT/ChannelOut');

if ~isempty(results)

% Loop through the results, capturing the pixel colors from the valid data

% associated with each coordinate

for i = 1:length(results.qCoord)

if (results.qv(i) == 1)

% hdl_p is the valid output from the ModelSim HDL simulation

hdl_p(int(results.qCoord(i,2))+1, int(results.qCoord(i,1))+1)

= int(results.qColor(i));

end

if (results.qv_stm(i) == 1)

% sim_p is the valid output from the Simulink simulation

sim_p(int(results.qCoord_stm(i,2))+1, int(results.qCoord_stm(i,1))+1)

= int(results.qColor_stm(i));

end

end

if isempty(hdl_p)

error('No valid ModelSim data generated. Aborting plot.');

else

% Plot the Modelsim simulation results

figure('Name','ModelSim Results'); imagesc(hdl_p);

end

if isempty(sim_p)

error('No valid Simulink data generated. Aborting plot.');

else

% Plot Simulink simulation results

figure('Name','Simulink Results'); imagesc(sim_p);

end

if ~isempty(hdl_p) && ~isempty(sim_p)

% Create an array of differences. This will be 0 at every coordinate that

% matches, non-zero at every difference.

diff_array = (hdl_p - sim_p);

% Plot this to visualize the location of differences.

figure('Name','Differences'); imagesc(diff_array);

% Count the number of mismatched pixels

num_mismatches = sum(sum(diff_array ~= 0));

% The number of mismatches should ideally be zero.

% However, the algorithm determines the pixel color of a coordinate

% not in the Mandelbrot set according to how many iterations before

% the sequence is known to be unbounded. The simple test for this is

% by comparing the magnitude squared to 4.

% For some pixels this iterative value may be very close to 4 on some

% iteration, such that in HDL the number of iterations before exiting

% may differ compared to that in Simulink simulation.

% This is especially likely near the unit circle, for points that take

% near the maximum number of iterations to determine whether they remain

% bounded. Which is correct? Perhaps which is closest to what a double

% precision calculation would give.

passed = (num_mismatches ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download