Wow! Even Microsoft uses AutoESL's C synthesis to speed up its SW

We purchased AutoESL's AutoPilot in 2008 to implement some of the time-
consuming cores in our software into FPGA hardware for the runtime speed-up
improvements. We found this can often accelerate our SW runtimes by 2-3
orders of magnitude. The AutoESL C-to-RTL synthesis tool claims to support
both Altera and Xilinx FPGAs, as well as ASICs, but we only tried it on
Altera Stratix II's. Our software:

1. RankBoost - a machine-learning algorithm used in the dynamic ranking
of search engines. RankBoost is several thousand lines of ANSI C
with the synthesizable time-consuming being 149 lines.

We used AutoPilot to generate RankBoost's core computation logic and
integrate it to existing interface IP cores like the DDR2 controller
IP core. AutoESL utilized common megafunctions for the target devices
and automatically generated the Avalon bus interface. The final
implementation had about 12000 ALUTs on the Altera Stratix II FPGA.

2. Sorting Algorithm - also several thousand lines of OO C++ code with 138
lines that needed speeding up. AutoESL again utilized megafunctions
for our Stratix II's. Additionally, we used AutoPilot's "APInt", or
arbitrary precision integers. AutoPilot has a source-level simulation
utility for APInt, and the resource usage depends on the size of the
array in the processing engine used. For example, when the sorting
process engine array is 128, the synthesis result shows a total ALUT
of 11346.

In general, AutoPilot takes a high level description of a design in ANSI C,
OO C++ or SystemC as input, and synthesizes it into Verilog/VHDL RTL code.
AutoPilot also automatically generates a vector-based testbench from the
C/C++ level testbench for users to use to verify the design.


ANSI C vs. OO C++ vs. SystemC

AutoPilot supported all the ANSI C and C++ language constructs we required
it to support implementing our algorithms in hardware. Our standard C/C++
function parameters were synthesized into various handshaking, memory,
streaming, and bus interfaces. We didn't test all the language features
that AutoESL claimed to supported, but I believe our two cases covered the
most commonly used language features that we may use as input to AutoPilot.

For our RankBoost, we used ANSI C. In the Sorting Algorithm, we used C++.

1. We wrote our RankBoost design in ANSI-C. (C is simple and compact).

2. We wanted to implement the Sorting Algorithm using an object oriented
style code; since ANSI-C is not object oriented, we used C++ for it.
We wrote the Sorting Algorithm to take advantage of a couple of C++
features, including classes and templates, so that the code itself
would be more generic and reusable. For example, our data elements
types could be easily configured with template parameters.

3. We used AutoESL's APint data type (arbitrary precision integer data
type) for the Sorting Algorithm. APint is supported in both C and C++,
but the implementation of APint in C++ using templates was easier,
since AutoESL's C++ APInt is also a templatezed class.

We never tested the object-oriented C++ code in AutoESL; we had committed to
one particular Sorting Algorithm (odd-even sort) with fixed data type for
the implementation, and OO was not a must-have for this purpose.


Design exploration:

One aspect of AutoPilot is that its fast runtime allowed us to do in-depth
explorations of the design space. For RankBoost's core computation logic,
we investigated different performance/area tradeoffs while doing quick
retargeting from one ANSI C source to 2 different FPGA tech libraries in a
Altera Stratix-II:

Design Mem FP Reg Logic Latency
and Lib (bits) adder FFs ALUTs cycles MHz
--------- ------ ----- ---- ----- ------- ------
AutoPilot 128K 8 7911 5886 19M 140.55
(8 PEs,
XtremeData
floating
point lib)

AutoPilot 128K 8 6295 5295 19M 107.03
(8 PEs
Altera FPU
lib)

AutoPilot 144K 12 9999 9706 14M 105.49
(12 PEs,
Altera FPU
lib)

Hand-generated code vs. AutoPilot generated code:

Hand-coded 128K 8 5373 5523 19M 125.00
RTL

AutoPilot 128K 9 5453 5316 19M 125.00
(final
design)

AutoPilot's RTL code generation time for the core SW in RankBoost was only
about 1.5 minutes -- near-zero compared to our time to hand-code RTL.
Because of AutoPilot's fast synthesis time, we did additional design space
exploration to select the best configuration for the most optimal design.
We were able to get a QoR comparable to hand-coded RTL yet we still saw a
75% project time savings.

Manual RTL creation time, including verification: 2 months
AutoPilot RTL creation time, including verification: 2 weeks

The above time to create RTL with AutoPilot included 5 major revisions of
our C code for RankBoost. We had cropped the initial code from RankBoost's
software implementation, and found the original coding style could be more
efficiently written for C synthesis implementation and optimization. We
had two kinds of modifications on RankBoost:

1. Modifying the ANSI C code for better C synthesis. For example, the
major body of our code was initially written in the main() function.
For synthesis, we wrapped the code into a separate function in main(),
with this new function specified as the top module to be synthesized.

We also made changes to the parameters of the function and assigned
the interface type to the input and output as the following shows:

void foo(float * mem_data,
volatile uint64 * input_dataport1,
volatile float * input_dataport2,
int size,
volatile float *output)
{
#pragma AUTOPILOT INTERFACE fifo port=input_dataport1
#pragma AUTOPILOT INTERFACE fifo port=input_dataport2
#pragma AUTOPILOT INTERFACE fifo port=output

//major body of the code here

}

Note: The "volatile" pointer type is needed to specify a FIFO. If a
pointer is marked as volatile, the compiler won't optimize the number
and order of its read and write accesses.

2. Modifying C code for improving code optimization. For example, in our
initial code, we had

for (j = 0; j < 255; ++j)
{
k = 255 - j;
fHisto[k - 1] += fHisto[k];
}

This piece of code was used to build an integral histogram from a
256-bin histogram. We had thousands of histograms to be processed.
Each histogram is stored in an array declared as

float fHisto [256];

Since the floating point adder in Altera's megafunction library needs
7 to 8 cycles to output the result and there is a read-after-write
dependency, the addition operation could not be fully pipelined in the
above code. To remove bubbles in the pipeline, we put 16 histograms
together:
float fHisto [16][256];

And then processed them in an interleaved manner:

for (j = 0; j < 255; ++j)
{
for (i = 0; i < 16; ++i)
{
#pragma AUTOPILOT pipeline II=1
k = 255 - j;
fHisto[i][k - 1] += fHisto[i][k];
}
}

Notice that we used a pragma to specify the loop pipelining interval.
To boost data-level parallelism, we implemented 8 more pipelines since
the histograms are independent of each other:

float fHisto[8][16][256];
for (j = 0; j < 255; ++j)
{
for (i = 0; i < 16; ++i)
{
#pragma AUTOPILOT pipeline II=1
for (k = 0; k < 8; ++k)
{
#pragma AUTOPILOT unroll
fHisto[k][i][255 - j - 1] += fHisto[k][i][255 - j];
}
}
}

This code was then synthesized to a 8-way SIMD (Single Instruction,
Multiple Data) engine. Through these code changes, we avoided the
bubbles in the RankBoost pipeline, reduced the latency, and fully
utilized data parallelism with an 8-way SIMD architecture.

I would like to mention that we could easily change it to a 16-way SIMD
by simply adding and modifying a few lines in the RankBoost C code.

On our Sorting Algorithm, the generated logic from AutoPilot was so close to
our theoretically optimal results that we saw no reason to implement it
manually for comparison purposes. We just used AutoPilot's RTL. So I don't
have hand-code RTL vs. AutoPilot RTL data for the Sorting Algorithm.


The set-up and learning curve for AutoPilot:

It took us less than 1 day to set up the AutoPilot environment for the first
time, and only several minutes for the follow-on designs.

In the early stages, our design methodology was an iterative loop between
constraining AutoPilot synthesis and results analysis with its built-in
Control Data Flow Graph (CDFG). Later, we started with the targeted micro
architecture in mind and then we created the C/C++ code plus corresponding
synthesis directives. So it was important to our implementation to be
familiar with AutoPilot's directives. Here's our ramp-up for AutoESL:

- 1 to 2 days for onsite training on AutoPilot: basics, methodologies,
tool setup, hands-on tutorials.

- 1 to 2 weeks to begin with your own design and learn by doing. In
our case, we did this was our RankBoost project.

- 3 to 4 additional weeks to try out AutpPilot's other advanced features
like: simulation, integration with SoPC, customized IP, floating point,
advanced language optimization, etc. This process may take some time
while I also prefer a "learning by doing" style because some advanced
features will only be adapted in special cases. We did this with our
Sorting Algorithm project.

So, overall a hardware designer experienced in RTL simulation and synthesis
should expect to spend 6 to 7 weeks getting ramped up on AutoESL. Much of
this depends on how deeply they want to learn its advanced features:

- Controls. Our users control results in several ways, including adding
synthesis directives to control pipelining, interfaces, and memory
using Tcl commands or pragmas or the GUI.

- GUI. AutoPilot has a GUI for users to understand the generated logic.
For example, it has a schedule viewer to visualize the scheduling
result and a report view so you can easily compare QoRs for different
implementations.

- Floating point synthesis. We used single-precision float type and
floating point adders for RankBoost; AutoPilot fully supports these
standard single- and double-precision floating point data types for
Altera platforms. We could directly synthesize common floating-point
math routines such as square root, exponentiation, logarithm, etc.

- Loop and hierarchical function pipelining. AutoESL's loop pipelining
allows multiple successive iterations of a loop to execute in parallel
by initiating one iteration before the previous one has completed.
This can optimize the design for both loop throughput and latency.

- Power reduction. AutoPilot's optimization also includes various
transformations for power reduction, including Operation Gating, MUX
optimization and reduction, FSM coding, pipeline register gating, clock
gating as well as using given Multi-Vdd assignment. We don't pay much
attention to power consumption with our current FPGAs so we didn't use
this power functionality, but it's an important feature and we would
like to try it out in the near future.

- Interface Synthesis. Our designers use AutoPilot's standard function
parameters to infer the desired inputs and outputs to the environment
rather than hand code any target-specific interface timing behaviors
into our C/C++ source. AutoPilot's interface synthesis converts the
parameter reads and writes into the actual interface accesses. The
direction of the data transfer is inferred from the way a parameter is
used in the function body.

For example, based on the specified communication interfaces in the
platform library, a store operation on a scalar pointer (e.g., *p = x)
can be turned into a direct wire connection, or a FIFO write, or even
a bus write transfer. This helps tremendously to keep our designers
away from the "devil-is-in-the-details" of the target platform and
focus more on developing the functional/algorithmic part of the design.

Currently, AutoPilot supports the following types of interfaces:

- Wire interface,
- Buffer interface,
- Memory interface,
- FIFO interface,
- Bus interface

The user can control the selection of interface with a few pragmas.


AutoPilot's negatives:

- It needs better user interface with its CDFG.

- Needs better 3rd party tool chain support. It took us a while to
setup the whole tool chain including our ModelSim RTL simulator and
Altera Quartus II FPGA implementation tools.

- AutoESL claims that AutoPilot supports Altera's SOPC builder tool
and Avalon bus interconnects. However, we did not test these.

- It needs better Verilog support. AutoPilot includes some libraries
written only in VHDL, for example a few platform-specific bus
interface adaptors are generated only in VHDL. It would be better
if the Verilog version was generated as well.

AutoESL's technical support was professional and they covered the product,
integration into the design flow, and language. We gave AutoPilot a first
look in 2007 and it's been delivering major features for FPGA-based design
in its recent releases. This tool can produce very acceptable results in
a very short time.

I give AutoPilot a score of 4 out of 5 possible and would strongly recommend
it to others.

AutoPilot 的国内 customer

Wow! Even Microsoft uses AutoESL's C synthesis - DeepChip Homepage

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值