Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW

最新推荐文章于 2024-08-08 16:29:48 发布
changan2001
最新推荐文章于 2024-08-08 16:29:48 发布
阅读量591
点赞数
分类专栏： HLS 文章标签： microsoft c up sorting interface algorithm
HLS 专栏收录该内容
44 篇文章 0 订阅
订阅专栏
We purchased AutoESL's AutoPilot in 2008 to implement some of the time-

consuming cores in our software into FPGA hardware for the runtime speed-up

improvements.  We found this can often accelerate our SW runtimes by 2-3

orders of magnitude.  The AutoESL C-to-RTL synthesis tool claims to support

both Altera and Xilinx FPGAs, as well as ASICs, but we only tried it on

Altera Stratix II's.  Our software:



  1. RankBoost - a machine-learning algorithm used in the dynamic ranking

     of search engines.  RankBoost is several thousand lines of ANSI C

     with the synthesizable time-consuming being 149 lines.



     We used AutoPilot to generate RankBoost's core computation logic and

     integrate it to  existing interface IP cores like the DDR2 controller

     IP core.  AutoESL utilized common megafunctions for the target devices

     and automatically generated the Avalon bus interface.  The final

     implementation had about 12000 ALUTs on the Altera Stratix II FPGA.



  2. Sorting Algorithm - also several thousand lines of OO C++ code with 138

     lines that needed speeding up.  AutoESL again utilized megafunctions

     for our Stratix II's.  Additionally, we used AutoPilot's "APInt", or

     arbitrary precision integers.  AutoPilot has a source-level simulation

     utility for APInt, and the resource usage depends on the size of the

     array in the processing engine used.  For example, when the sorting

     process engine array is 128, the synthesis result shows a total ALUT

     of 11346.



In general, AutoPilot takes a high level description of a design in ANSI C,

OO C++ or SystemC as input, and synthesizes it into Verilog/VHDL RTL code.

AutoPilot also automatically generates a vector-based testbench from the

C/C++ level testbench for users to use to verify the design.





ANSI C vs. OO C++ vs. SystemC



AutoPilot supported all the ANSI C and C++ language constructs we required

it to support implementing our algorithms in hardware.  Our standard C/C++

function parameters were synthesized into various handshaking, memory,

streaming, and bus interfaces.  We didn't test all the language features

that AutoESL claimed to supported, but I believe our two cases covered the

most commonly used language features that we may use as input to AutoPilot.



For our RankBoost, we used ANSI C.  In the Sorting Algorithm, we used C++.



  1. We wrote our RankBoost design in ANSI-C. (C is simple and compact).



  2. We wanted to implement the Sorting Algorithm using an object oriented

     style code; since ANSI-C is not object oriented, we used C++ for it.

     We wrote the Sorting Algorithm to take advantage of a couple of C++

     features, including classes and templates, so that the code itself

     would be more generic and reusable.  For example, our data elements

     types could be easily configured with template parameters.



  3. We used AutoESL's APint data type (arbitrary precision integer data

     type) for the Sorting Algorithm.  APint is supported in both C and C++,

     but the implementation of APint in C++ using templates was easier,

     since AutoESL's C++ APInt is also a templatezed class.



We never tested the object-oriented C++ code in AutoESL; we had committed to

one particular Sorting Algorithm (odd-even sort) with fixed data type for

the implementation, and OO was not a must-have for this purpose.





Design exploration:



One aspect of AutoPilot is that its fast runtime allowed us to do in-depth

explorations of the design space.  For RankBoost's core computation logic,

we investigated different performance/area tradeoffs while doing quick

retargeting from one ANSI C source to 2 different FPGA tech libraries in a

Altera Stratix-II:



        Design     Mem     FP      Reg    Logic   Latency

        and Lib   (bits)  adder    FFs    ALUTs   cycles    MHz

       ---------  ------  -----   ----    -----   -------  ------

       AutoPilot   128K     8     7911    5886     19M     140.55

       (8 PEs, 

       XtremeData

       floating

       point lib)



       AutoPilot   128K     8     6295    5295     19M     107.03

       (8 PEs

       Altera FPU

       lib)



       AutoPilot   144K    12     9999    9706     14M     105.49 

       (12 PEs, 

       Altera FPU

       lib)



Hand-generated code vs. AutoPilot generated code:



       Hand-coded  128K     8     5373    5523     19M     125.00

       RTL



       AutoPilot   128K     9     5453    5316     19M     125.00

       (final

       design)



AutoPilot's RTL code generation time for the core SW in RankBoost was only

about 1.5 minutes -- near-zero compared to our time to hand-code RTL.

Because of AutoPilot's fast synthesis time, we did additional design space

exploration to select the best configuration for the most optimal design.

We were able to get a QoR comparable to hand-coded RTL yet we still saw a

75% project time savings.



        Manual RTL creation time, including verification: 2 months

     AutoPilot RTL creation time, including verification: 2 weeks



The above time to create RTL with AutoPilot included 5 major revisions of

our C code for RankBoost.  We had cropped the initial code from RankBoost's

software implementation, and found the original coding style could be more

efficiently written for C synthesis implementation and optimization.  We

had two kinds of modifications on RankBoost:



  1. Modifying the ANSI C code for better C synthesis.  For example, the

     major body of our code was initially written in the main() function.

     For synthesis, we wrapped the code into a separate function in main(),

     with this new function specified as the top module to be synthesized.



     We also made changes to the parameters of the function and assigned

     the interface type to the input and output as the following shows:



           void foo(float * mem_data,

                      volatile uint64 * input_dataport1,

                      volatile float * input_dataport2,

                      int size,

                      volatile float *output)

           {

           #pragma AUTOPILOT INTERFACE fifo port=input_dataport1

           #pragma AUTOPILOT INTERFACE fifo port=input_dataport2

           #pragma AUTOPILOT INTERFACE fifo port=output



           //major body of the code here



           }



     Note: The "volatile" pointer type is needed to specify a FIFO.  If a

     pointer is marked as volatile, the compiler won't optimize the number

     and order of its read and write accesses.



  2. Modifying C code for improving code optimization.  For example, in our

     initial code, we had



                      for (j = 0; j < 255; ++j)

                        {

                          k = 255 - j;

                          fHisto[k - 1] += fHisto[k];

                        }



     This piece of code was used to build an integral histogram from a

     256-bin histogram.  We had thousands of histograms to be processed.

     Each histogram is stored in an array declared as



                      float fHisto [256];



     Since the floating point adder in Altera's megafunction library needs

     7 to 8 cycles to output the result and there is a read-after-write

     dependency, the addition operation could not be fully pipelined in the

     above code.  To remove bubbles in the pipeline, we put 16 histograms

     together:

                      float fHisto [16][256];



     And then processed them in an interleaved manner:



                      for (j = 0; j < 255; ++j)

                        {

                          for (i = 0; i < 16; ++i)

                            {

                              #pragma AUTOPILOT pipeline II=1

                              k = 255 - j;

                              fHisto[i][k - 1] += fHisto[i][k];

                            }

                        }



     Notice that we used a pragma to specify the loop pipelining interval.

     To boost data-level parallelism, we implemented 8 more pipelines since

     the histograms are independent of each other:



            float fHisto[8][16][256];

            for (j = 0; j < 255; ++j)

              {

                for (i = 0; i < 16; ++i)

                  {

                    #pragma AUTOPILOT pipeline II=1

                    for (k = 0; k < 8; ++k)

                      {

                      #pragma AUTOPILOT unroll

                      fHisto[k][i][255 - j - 1] += fHisto[k][i][255 - j];

                      }

                  }

              }



     This code was then synthesized to a 8-way SIMD (Single Instruction,

     Multiple Data) engine.  Through these code changes, we avoided the

     bubbles in the RankBoost pipeline, reduced the latency, and fully

     utilized data parallelism with an 8-way SIMD architecture.



     I would like to mention that we could easily change it to a 16-way SIMD

     by simply adding and modifying a few lines in the RankBoost C code.



On our Sorting Algorithm, the generated logic from AutoPilot was so close to

our theoretically optimal results that we saw no reason to implement it

manually for comparison purposes.  We just used AutoPilot's RTL.  So I don't

have hand-code RTL vs. AutoPilot RTL data for the Sorting Algorithm.





The set-up and learning curve for AutoPilot:



It took us less than 1 day to set up the AutoPilot environment for the first

time, and only several minutes for the follow-on designs.



In the early stages, our design methodology was an iterative loop between

constraining AutoPilot synthesis and results analysis with its built-in

Control Data Flow Graph (CDFG).  Later, we started with the targeted micro

architecture in mind and then we created the C/C++ code plus corresponding

synthesis directives.  So it was important to our implementation to be

familiar with AutoPilot's directives.  Here's our ramp-up for AutoESL:



  - 1 to 2 days for onsite training on AutoPilot: basics, methodologies,

    tool setup, hands-on tutorials.



  - 1 to 2 weeks to begin with your own design and learn by doing.  In

    our case, we did this was our RankBoost project.



  - 3 to 4 additional weeks to try out AutpPilot's other advanced features

    like: simulation, integration with SoPC, customized IP, floating point,

    advanced language optimization, etc.  This process may take some time

    while I also prefer a "learning by doing" style because some advanced

    features will only be adapted in special cases.  We did this with our

    Sorting Algorithm project.



So, overall a hardware designer experienced in RTL simulation and synthesis

should expect to spend 6 to 7 weeks getting ramped up on AutoESL.  Much of

this depends on how deeply they want to learn its advanced features:



  - Controls.  Our users control results in several ways, including adding

    synthesis directives to control pipelining, interfaces, and memory

    using Tcl commands or pragmas or the GUI.



  - GUI.  AutoPilot has a GUI for users to understand the generated logic.

    For example, it has a schedule viewer to visualize the scheduling

    result and a report view so you can easily compare QoRs for different

    implementations.



  - Floating point synthesis.  We used single-precision float type and

    floating point adders for RankBoost; AutoPilot fully supports these

    standard single- and double-precision floating point data types for

    Altera platforms.  We could directly synthesize common floating-point

    math routines such as square root, exponentiation, logarithm, etc.



  - Loop and hierarchical function pipelining.  AutoESL's loop pipelining

    allows multiple successive iterations of a loop to execute in parallel

    by initiating one iteration before the previous one has completed.

    This can optimize the design for both loop throughput and latency.



  - Power reduction.  AutoPilot's optimization also includes various

    transformations for power reduction, including Operation Gating, MUX

    optimization and reduction, FSM coding, pipeline register gating, clock

    gating as well as using given Multi-Vdd assignment.  We don't pay much

    attention to power consumption with our current FPGAs so we didn't use

    this power functionality, but it's an important feature and we would

    like to try it out in the near future.



  - Interface Synthesis.  Our designers use AutoPilot's standard function

    parameters to infer the desired inputs and outputs to the environment

    rather than hand code any target-specific interface timing behaviors

    into our C/C++ source.  AutoPilot's interface synthesis converts the

    parameter reads and writes into the actual interface accesses.  The

    direction of the data transfer is inferred from the way a parameter is

    used in the function body.



    For example, based on the specified communication interfaces in the

    platform library, a store operation on a scalar pointer (e.g., *p = x)

    can be turned into a direct wire connection, or a FIFO write, or even

    a bus write transfer.  This helps tremendously to keep our designers

    away from the "devil-is-in-the-details" of the target platform and

    focus more on developing the functional/algorithmic part of the design.



    Currently, AutoPilot supports the following types of interfaces:



          - Wire interface,

          - Buffer interface,

          - Memory interface,

          - FIFO interface,

          - Bus interface



    The user can control the selection of interface with a few pragmas.





AutoPilot's negatives:



  - It needs better user interface with its CDFG.



  - Needs better 3rd party tool chain support.  It took us a while to

    setup the whole tool chain including our ModelSim RTL simulator and

    Altera Quartus II FPGA implementation tools.



  - AutoESL claims that AutoPilot supports Altera's SOPC builder tool

    and Avalon bus interconnects.  However, we did not test these.



  - It needs better Verilog support.  AutoPilot includes some libraries

    written only in VHDL, for example a few platform-specific bus

    interface adaptors are generated only in VHDL.  It would be better

    if the Verilog version was generated as well.



AutoESL's technical support was professional and they covered the product,

integration into the design flow, and language.  We gave AutoPilot a first

look in 2007 and it's been delivering major features for FPGA-based design

in its recent releases.  This tool can produce very acceptable results in

a very short time.



I give AutoPilot a score of 4 out of 5 possible and would strongly recommend

it to others.  



AutoPilot 的国内 customer
Wow! Even Microsoft uses AutoESL's C synthesis - DeepChip Homepage

changan2001
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW

We purchased AutoESL's AutoPilot in 2008 to implement some of the time-<br />consuming cores in our software into FPGA hardware for the runtime speed-up<br />improvements. We found this can often accelerate our SW runtimes by 2-3<br />orders of magn
复制链接

扫一扫
专栏目录