FASTCUDA Steering Committee N°1 and Kick-off Meeting



FASTCUDAProject No: 286770D2.4. Final integrated version of CUDA to SystemC translator and user’s manual31st August 2013Abstract:This deliverable describes the functionality and operation of the tool that translates a CUDA kernel into a SystemC model that can then be synthesized using any FPGA synthesis tool starting from SystemC (e.g. Xilinx Vivado or Cadence CtoSilicon and Xilinx ISE). The tool also allows the designer to make some implementation trade-offs, such as unrolling or not some loops of the CUDA kernel which do not involve access to global or shared memory, and can thus be unrolled to increase performance at the expense of cost. The Graphical User Interfaces that allows the user to perform the translation was designed to be friendly and easy even for users who are not intimately familiar with hardware design.Document ManagerLuciano LavagnoPOLITOProfessorDocument Id N°:Final integrated version of CUDA to SystemC translator and user’s manualVersion:V0.1Date:28/08/13Filename: FILENAME \* MERGEFORMAT FASTCUDA-D2.4_POLITO_V0.1-28082013.docxDisclaimerThis document contains material, which is the copyright of certain FASTCUDA contractors, and may not be reproduced or copied without permission. All FASTCUDA consortium partners have agreed to the full publication of this document. The commercial use of any information contained in this document may require a license from the proprietor of that informationThe FASTCUDA Consortium consists of the following companies:Participant no.Participant organisation namesshort nameCountryP1 (Coordinator)Ingenieria de Sistemas Intensivos en Software LtdISISSpainP2 Politecnico di TorinoPOLITOItalyP3Universidad Politécnica de MadridUPMSpainP4Telecommunication Systems Institute (TSI - Technical University of Crete)TSIGreeceP5ArdoranARDEstoniaP6FSResult GmbHFSRGermanyThe information in this document is provided “as is” and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.An up-to-date version of this document can be found on FASTCUDA's website ().IntroductionIn this deliverable we describe in detail how a CUDA kernel can be translated into a SystemC implementation, which follows the interfacing mechanism with the host processor and shared memory which has been defined in D2.2 and D5.1. The Graphical User Interfaces that guides the user in this process requires minimal intervention. The only step in which a human decision may be needed is whether to unroll or not some loops. By default, no loop is unrolled, and hence in the final hardware implementation each iteration requires at least one clock cycle. It may however be advantageous, in order to speed up the hardware implementation of the kernel, to unroll some loops that involve only internal computations, in order to reduce control overhead and offer more parallelism to the hardware scheduler. This operation cannot be automated, because it involves decisions that only a human can make. The GUI makes it easy by allowing the designer to refer directly to loops in the original code.The reader is referred to D2.1 for a definition of the supported CUDA subset. The restriction to only one CUDA kernel per file, as described in earlier internal versions of this document, has been lifted, as mentioned in D2.3.Since D2.3, a major change has been introduced in the workflow, namely the support for the execution of Vivado HLS (the Xilinx high-level synthesis tool) directly from the GUI.Moreover, several bugs have been fixed, and the flow has been extensively tested on a number of CUDA test cases.tool installationThe final version of the tools can be found in the following SVN repository: svn://wormtongue.polito.it/svn/fastcudaPlease contact fastcuda-synthesis@gforge.inetsis.es in order to obtain support for installation or usage, including bug reports. The FASTCUDA graphical user interface (GUI) is developed with a cross-platform application framework called QT. In order to compile the application you should install the Qt widget toolkit library, e.g. by executing:$ sudo apt-get install libqt4-core libqt4-dev libqt4-gui qt4-dev-toolsOr whatever other command is required on your Linux installation to get a package.Moreover, the GUI relies on a modified version of the MCUDA translator from CUDA to C, originally from (please see the MCUDA web site for details on the MCUDA license). MCUDA generates:SystemC code for the body of each kernel to be implemented in hardware, and the information that the FASTCUDA GUI needs in order to synthesize the SC_MODULE interface in SystemC.Note that MCUDA is already precompiled (from java) when checking out the FASTCUDA software from the repository mentioned above.The commands required to compile the CUDA GUI are:$ cd src/GUI$ qmake -project QT+=core QT+=gui QT+=xml$ qmake$ make clean$ makeThen you can run the application with:$ ./GUIPlease note that the name has changed since D2.3, to better qualify its functionality. The GUI used to be called “FastCuda”.Execution of Vivado HLS from the GUI requires that the tool is installed on the machine on which the GUI runs, and that it is executable from the command line as:$ vivado_hlstool USAGEThe GUI uses six main windows to control the FastCuda project, as shown in REF _Ref348717007 \h Figure 1 .Figure SEQ Figure \* ARABIC 1 Initial screenshot of FASTCUDA GUIIn the left window there are two Text Editors, called "cuda" and "m-cuda", organized in a tab widget.The “cuda” text editor is used for the CUDA input file, that is the starting point of the SystemC code generation. This text editor works like a normal linux text editor (e.g. gedit). The source code can also be loaded with the "File->open" menu or the open icon, as shown in REF _Ref348717096 \h Figure 2.Figure SEQ Figure \* ARABIC 2 CUDA file opening dialogueThe “m-cuda” text editor is used to show the generated C file that is the body of the SystemC thread implementing the kernel in hardware, as obtained from the MCUDA compiler. Its content generally does not need to be modified, but it is useful to select the loop(s) to be unrolled in the hardware implementation. In the example shown in REF _Ref348717168 \h Figure 3, the loop called “LOOP_11” is shown in the MCUDA editor, and selected in the right hand window to be unrolled.Figure SEQ Figure \* ARABIC 3 MCUDA output and loop structure selectionIn the middle window there are three tabs that show the results of executing MCUDA.The “tree” tab represents the top level of the abstract syntax tree of the CUDA kernels. For each kernel it shows:cuda function: name of the kernel.function declaratory: name and type of the input arguments (that will become signals in SystemC, as described in D2.2 and D5.1).compound statement: a list of loops (labelled by MCUDA for convenience) in the CUDA source. All those loops can be selected to be “unrolled” by the SystemC synthesis tool in order to improve performance. Profiling information should be used as a guidance to select which loops should be unrolled. Please note that unrolling a loop causes both an improvement in performance and an increase in area.The “xml” tab is a text editor that shows the xml file produced by MCUDA (this is useful mostly for debugging purposes).The “vivado” tab will show, after synthesis is performed in the next stage, a summary of the results of FPGA implementation of the kernels.In the bottom left window there is a log that shows some information about the state of the FASTCUDA SystemC translation process, as shown in REF _Ref348717211 \h Figure 4. The same figure also shows statistics on the cost of the FPGA implementation of the kernel and its local memories in terms of:Block RAM (BRAM) blocks to implement CUDA shared memory and some local variables,DSP units to implement multiplications and additions,Flip-Flops to implement inter-thread and intra-thread control,Look-Up Tables (LUTs) to implement random logic.Figure SEQ Figure \* ARABIC 4 Final results after SystemC code generation and synthesisThis data, collected and shown for each kernel, together with the profiling and design space exploration data that is summarized in the two rightmost tabs, helps the designer make decisions on the best HW/SW partitioning. Changes of the earlier decisions about loop unrolling may of course also be necessary in order to meet the performance targets for HW kernels. More unrolling exposes more parallelism to the tool, and thus often improves performance. However, it increases the HW resources required, and may not be beneficial if the bottleneck of the loop being unrolled is due to memory accesses (which are limited by the number of BRAM ports). REF _Ref368582152 \h Figure 5 shows the Vivado HLS synthesis results for a CUDA test case containing two (very simple) kernels.Figure SEQ Figure \* ARABIC 5 Final results for two HW kernelsThe GUI uses both a menu and an icon-based toolbar to offer the user the main commands, in order to manage the CUDA source file and run the FASTCUDA translation steps:“File” and “Edit” control the CUDA text editor.“Project” control the FASTCUDA translation steps:"Compile" runs MCUDA on the CUDA file shown in the CUDA editor. The MCUDA output is displayed in the “mcuda” text editor, as well as in the “tree” and “xml” tag windows.“Synthesize” creates the SystemC and TCL files used by the synthesis tools (Vivado HLS from Xilinx and CtoSilicon from Cadence are currently supported). The contents of these files are displayed in the “Log” window.The “Synthesize” button also executes Vivado, if it is installed on the host. The results are shown on the “vivado” tab on the central column.“Estimation” performs software performance estimation, to evaluate which kernels represent the performance bottlenecks of the application. This is described more in detail in D4.1 (Implementation of estimation tools).“Exploration” starts the design space exploration step, that given area and performance numbers chooses the best HW/SW partitioning. This is also described more in detail in D4.3 (Final implementation of exploration tool).OUTPUT FILESThe FASTCUDA GUI creates four files for each processed CUDA kernel, with name <kernel_name>, in a sub-directory called <kernel_name>__MCUDA_kernel :defines.h contains the macros that define:the kernel name, derived from the CUDA function name by appending _MCUDA_kernel to it.the module name, derived from the CUDA source file name without the .cu extension, by appending the kernel name to it.decl.h contains the kernel input argument names, declared as sc_in.unroll.tcl and directives.tcl contain the loop unrolling commands (if any) for the synthesis tool. The former uses the syntax used by the CtoSilicon synthesis tool, and the latter the syntax used by the Vivado HLS tool. It can be easily converted to the format used by other tools. Please contact the support team to use it with other tools. Note: currently these two files are created in the CUDA file directory, not in the kernel sub-directory.These files are meant to be included into the file, which is found in the repository under the experiments/matmul/fpga directory, and which contains:the top-level SystemC module interface, the start/ready protocol for synchronization with the host processor the interface with the FASTCUDA global memory controller (GMC), described in D5.1 (the interface is actually modelled in the gmem.h file).Moreover, <CUDA_file_name>.c is created in the CUDA file directory (one level above the above mentioned sub-directories). It contains the kernel bodies, translated into C++, and ready for inclusion as a SystemC thread.The following files are automatically copied in each kernel sub-directory, in order to enable its synthesis:A file called gmem.h, which contains the interface between the synthesized kernel and global memory, using:Either the FASTCUDA global memory interface controller, Or a TLM-2 AXI bus transactor interfacing directly to the Xilinx DDR3 controller.Some sample TCL scripts which can be used to synthesize the SystemC kernel using the CtoSilicon tool (the top-level is called ctos.tcl, and it includes build.tcl and setup.tcl). A Vivado HLS project setup, in the sub-directory matmul, ready for synthesis using Vivado HLS.Again, please contact the support team for help with using tools other than Vivado HLS and CtoSilicon.Example design: matrix multiplicationThe experiments/matmul directory in the repository contains an example design, namely the matrix multiplication kernel that was already described in the deliverables describing the FASTCUDA hardware synthesis strategy. The main source file is called matmul.cu, and it is taken directly from the CUDA programmer’s guide. The user can choose it with the FASTCUDA GUI in order to illustrate the SystemC synthesis steps.The experiments/matmul/MatrixMulKernel__MCUDA_kernel sub-directory (created for the MatrixMulKernel contained in matmul.cu) also contains:A simulation testbench, called , that stimulates the matrix multiplication kernel to perform the multiplication of two 128x128 matrices.The constants.h file that contains some constants that are used by and by to size the AXI burst cache, the AXI burst length, the memory latency when simulating the global memory interface, etc.Several files modelling:the AXI master interface, andthe AXI slave and DDR3 controller provided by Xilinx.These files can be used to perform a stand-alone simulation of the matrix multiplication kernel without the rest of the FASTCUDA infrastructure (multi-processor, memory controllers, etc.). The run_mon file contains the command line to simulate this setup using the Incisive simulator. Please note that valid licenses from Xilinx and Cadence are needed to use these tools and files.The TCL scripts that drive CtoSilicon to generate a synthesizable RTL file in Verilog for the kernel are:ctos.tcl, the top-level synthesis script.build.tcl, the script (called by ctos.tcl) that reads in SystemC and builds the internal database (meant to be used for interactive synthesis with the CtoSilicon GUI).setup.tcl, the script (called by build.tcl) that defines the synthesis options, e.g.:Whether to use a Block RAM (BRAM) or registers to implement local and shared arrays.Whether to use the GMC or directly read/write the DRAM via the AXI bus.The Vivado HLS project setup files are ready for a Virtex 7 synthesis run. They include the directives.tcl file that is generated by the FASTCUDA GUI for Vivado, specifying the loops to unroll. Please edit them to change, for example, the Xilinx FPGA platform. REF _Ref348717211 \h Figure 4 above shows the results of SystemC synthesis for the matrix multiplication test case.ConclusionThis report showed how to use the FASTCUDA translation tool GUI to implement a CUDA kernel in SystemC. Integration of the tool with Vivado HLS has been implemented in the last period of the project, and it has been used to perform a variety of design experiments.AppendixThe appendix shows the synthesis report which is provided by Vivado HLS and which is used to generate the synthesis summary results in the “vivado” tab of the FASTCUDA GUI. The example comes from the matrix multiplication test case discussed above.================================================================== Report Version================================================================* Tool: Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC* Version: 2012.3* Build date: Fri Oct 12 10:57:10 AM 2012* Copyright (C): 2012 Xilinx Inc. All rights reserved.================================================================== General Information================================================================* Project: project1* Solution: solution1* Date: Wed Feb 27 16:37:30 2013================================================================== User Assignments================================================================* Product Family: virtex7 virtex7_fpv6 * Part: xc7vx330tffg1157-2* Top Model name: matmul_MatrixMulKernel_MCUDA_kernel* Target clock period (ns): 10.00* Clock uncertainty (ns): 1.25================================================================== Performance Estimates================================================================+ Summary of timing analysis: * Estimated clock period (ns): 8.53+ Summary of overall latency (clock cycles): * Best-case latency: ? * Average-case latency: ? * Worst-case latency: ?================================================================== Area Estimates================================================================* Summary: (Target device: xc7vx330tffg1157-2)+---+-----------------+---------+-------+--------+--------+-------+| ID| Name| BRAM_18K| DSP48E| FF| LUT| SLICE|+---+-----------------+---------+-------+--------+--------+-------+| 0| Component| 13| 24| 2215| 3136| -|| 1| Expression| -| -| -| -| -|| 2| FIFO| -| -| -| -| -|| 3| Memory| -| -| -| -| -|| 4| Multiplexer| -| -| -| -| -|| 5| Register| -| -| 99| -| -|| 6| ShiftMemory| -| -| -| -| -|+---+-----------------+---------+-------+--------+--------+-------+| -| Total| 13| 24| 2314| 3136| 0|+---+-----------------+---------+-------+--------+--------+-------+| -| Available| 1500| 1120| 408000| 204000| 51000|+---+-----------------+---------+-------+--------+--------+-------+| -| Utilization (%)| ~0| 2| ~0| 1| 0|+---+-----------------+---------+-------+--------+--------+-------++ Details: * Component: +---+----------------------------------------------------------------------------------------------+---------+-------+------+------+ | ID| Name| BRAM_18K| DSP48E| FF| LUT| +---+----------------------------------------------------------------------------------------------+---------+-------+------+------+ | 0| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)| 13| 24| 2215| 3136| +---+----------------------------------------------------------------------------------------------+---------+-------+------+------+ | -| Total| 13| 24| 2215| 3136| +---+----------------------------------------------------------------------------------------------+---------+-------+------+------+ * Expression: N/A * FIFO: N/A * Memory: N/A * Multiplexer: N/A * Register: +---+---------------+-----+-------+----+ | ID| Name| Bits| Consts| FF| +---+---------------+-----+-------+----+ | 0| DataOut_GM| 32| 0| 32| | 1| RD_Address_GM| 32| 0| 32| | 2| RD_Req_GM| 1| 0| 1| | 3| WR_Address_GM| 32| 0| 32| | 4| WR_Req_GM| 1| 0| 1| | 5| ready| 1| 0| 1| +---+---------------+-----+-------+----+ | -| Total| 99| 0| 99| +---+---------------+-----+-------+----+ * ShiftMemory: N/A* Hierarchical Multiplexer Count: +---+----------------------------------------------------------------------------------------------+-----+------+------+| ID| Name| Size| Bits| Count|+---+----------------------------------------------------------------------------------------------+-----+------+------+| 0| (This level)| 0| 0| 0|| 1| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)| 157| 1028| 2490|+---+----------------------------------------------------------------------------------------------+-----+------+------+| -| Total| 157| 1028| 2490|+---+----------------------------------------------------------------------------------------------+-----+------+------+================================================================== Power Estimate================================================================* Summary: +---+-------------+------+| ID| Name| Power|+---+-------------+------+| 0| Component| 537|| 1| Expression| -|| 2| FIFO| -|| 3| Memory| -|| 4| Multiplexer| -|| 5| Register| 9|| 6| ShiftMemory| -|+---+-------------+------+| -| Total| 546|+---+-------------+------+* Hierarchical Register Count: +---+----------------------------------------------------------------------------------------------+------+| ID| Name| Count|+---+----------------------------------------------------------------------------------------------+------+| 0| (This level)| 99|| 1| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)| 1945|+---+----------------------------------------------------------------------------------------------+------+| -| Total| 2044|+---+----------------------------------------------------------------------------------------------+------+================================================================== Interface Summary================================================================* Interfaces: +---+---------------+----------------------------------------------------------------------------+--------------+------+------------+----------+-----+-----+| ID| RTL Ports| Object| Type| Scope| IO Protocol| IO Config| Dir| Bits|+---+---------------+----------------------------------------------------------------------------+--------------+------+------------+----------+-----+-----+| 0| A| A| pointer| -| -| -| in| 32|| 1| wB| wB| pointer| -| -| -| in| 32|| 2| wA| wA| pointer| -| -| -| in| 32|| 3| C| C| pointer| -| -| -| in| 32|| 4| B| B| pointer| -| -| -| in| 32|| 5| oblockIdx_x| oblockIdx_x| pointer| -| -| -| in| 32|| 6| oblockIdx_y| oblockIdx_y| pointer| -| -| -| in| 32|| 7| oblockIdx_z| oblockIdx_z| pointer| -| -| -| in| 32|| 8| oblockDim_x| oblockDim_x| pointer| -| -| -| in| 32|| 9| oblockDim_y| oblockDim_y| pointer| -| -| -| in| 32|| 10| oblockDim_z| oblockDim_z| pointer| -| -| -| in| 32|| 11| clk| matmul_MatrixMulKernel__MCUDA_kernel::matmul_MatrixMulKernel__MCUDA_kernel| return value| -| -| -| in| 1|| 12| reset| -| -| -| -| -| in| 1|| 13| start| start| pointer| -| -| -| in| 1|| 14| ready| ready| pointer| -| -| -| out| 1|| 15| RD_Req_GM| RD_Req_GM| pointer| -| -| -| out| 1|| 16| RD_Address_GM| RD_Address_GM| pointer| -| -| -| out| 32|| 17| ACK_GM| ACK_GM| pointer| -| -| -| in| 1|| 18| DataIn_GM| DataIn_GM| pointer| -| -| -| in| 32|| 19| WR_Req_GM| WR_Req_GM| pointer| -| -| -| out| 1|| 20| WR_Address_GM| WR_Address_GM| pointer| -| -| -| out| 32|| 21| DataOut_GM| DataOut_GM| pointer| -| -| -| out| 32|+---+---------------+----------------------------------------------------------------------------+--------------+------+------------+----------+-----+-----+ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download