Dan Joyce‘s 16 bug types only found with gate-level simulation

最新推荐文章于 2025-01-08 18:02:57 发布

甲六乙

最新推荐文章于 2025-01-08 18:02:57 发布

阅读量85

点赞数

分类专栏： Verification 文章标签：芯片验证

原文链接：http://www.deepchip.com/items/0569-01.html

版权

Verification 专栏收录该内容

8 篇文章

订阅专栏

原文：Dan Joyce's 16 bug gate-level simulation

Subject: Dan Joyce's 16 bug types only found with gate-level simulation

The three takeaways I got from Wally Rhines' DVcon'16 keynote were:

Static verification is fastest growing category in EDA. Emulation closely follows as number 3. (#2 was not mentioned)
New kinds of focused verification will continue be adopted, such as reset-domain crossing and constraints verification. (At Real Intent, we see new requirements for focused static solutions at the gate-level also.)
...

Followed by Jim Hogan doing a panel at that DVcon asking:

"Q. Will formal and emulation replace SW simulation tools?"

Followed by Lauro Rizzatti saying:

"As my colleagues have said, complexity is killing simulation. There is good news in the emulation camp, which I have followed for 22 years. Today emulation does most of what simulation does with the exception of timing and multi-value logic. It works 4 to 6 orders of magnitude faster."

And Brian Hunter of Cavium:

"Anytime we find a bug with an emulation platform, we always go back and question what did we do wrong in simulation." ...

    - Prakash on DVcon'16, portable stimulus, & end of simulation

From: "Dan Joyce" <user=danj domain=correctdesigns not calm>

Hi, John,

First off, I want to thank Prakash for writing his DVcon'16 Trip Report.
I love the fact that these critical issues are being discussed and then 
disseminated in such a public way.  I couldn't make that DVcon, but I 
found it disturbing to see how many engineers were nonchalantly drinking

the Jonestown Kool-Aid that formal tools and much faster emulation boxes are
going to replace old school Verilog/VHDL RTL software simulators in chip
design -- and especially with that truly dangerous notion of not bothering
to do gate-level simulations at all -- or even considering just dropping
timing from their Gatesims altogether.

If I was there at that DVcon, I would have been seriously fighting the
urge to shout down those engineers thinking of skipping gate-level sims.

Why should anyone listen to Dan Joyce about gate-level sims?  I have 25
years in chip design and verification.  I taped-out 22 chips in that time 
and only had one of those 22 chips back in 1995 had to be respun.  So my 
record is 21 out of 22 -- which is a 95.5% success rate -- something I'm 
proud of.  Many of these were huge chips pushing the technology of the day.  
My most recent chip was a 1.25 billion instance 16nm TSMC FinFET design. 

        ----    ----    ----    ----    ----    ----    ----

WHY YOU MUST STILL DO GATE-LEVEL SIMULATIONS (GLS) TODAY

Using gate-level simulations (GLS), I've found both functional and timing
bugs in chips at every stage of chip design -- from the start of early
functional development -- all the way down to subtle yet chip-fatal timing
bugs just 2 days before final tapeout.

Going in, I have to issue a "heads-up" warning: GLS can be an extremely
expensive task that fails to find critical design flaws before chips are
released to manufacturing -- if done wrong.  However if done right I see
the Cost/Benefit of Gatesims getting better today.  Regardless, GLS finds
chip-killing bugs that formal, STA, ABV, lint, and emulation won't even
notice.  You are taking a big risk if you don't do GLS.

That said, let me set the stage...


THAT DAMNED ZERO-DELAY PROBLEM...

The goal of any chip development team is to have processes (lint, LEC, STA,
verification) so solid that GLS will never find any bugs and not be needed.

But GLS does find bugs; most very late in the design process and close to
tape-out.  GLS bugs are often very serious, and tend to cause problems with
no possible workaround.  Even when workarounds are possible, finding and
debugging GLS failures is much easier than debugging silicon in the lab.

It's not uncommon to chase bugs in the lab for months which could have
been found and debugged in a couple hours in GLS.  Although GLS is harder
to debug than RTL, it is much easier than silicon.

GLS bugs exist because practically all chip Verilog/SystemVerilog/VHDL
simulations are done with "ideal world" zero-delay tests that are run on
pre-synthesized Veilog/VHDL/SystemVerilog RTL code.  These sims are
tailored to speed-up simulation runtime performance -- but sacrifice their
ability to catch certain types of bugs.

Additionally, many steps are performed on the design after the RTL is
verified to produce the gate netlist that is used to manufacture the 
silicon.  Some of these post-RTL steps include synthesis to gates; place
and route; power insertion; adding logic for Built In Self Test (BIST) and
Built In Self Repair (BISR); and insertion of Design For Testability (DFT)
logic.  RTL tests won't find bugs from any of those steps since the logic
added during those steps weren't in the orginal RTL to begin with!

In addition GLS finds chip timing issues missed by Primetime/Tempus STA
due to bad timing constraints.  The gate-level model is much closer to
the real silicon design -- testing at gates frequently finds timing bugs
in asynchronous logic that cannot be found in RTL.

SPECIAL WARNING: GLS has gotten harder as today's chip designs have gotten
bigger; but at the same time the need for GLS is greater than ever before.
The amount of logic in your chip that does not exist in the RTL design has
increased.  Add the complexity of 100's of unrelated clocks has increased
the risk of your final physical blocks not closing timing dramatically in
the last few years.  Finally as the man-hours and calendar time needed to
go from tapeout to silicon has increased -- making the cost of a gate-level
bug much higher -- and making it even more important than ever to get back
good working silicon on the 1st pass.


DAN JOYCE'S 16 BUG TYPES ONLY FOUND BY GATE-LEVEL SIMULATION

The following is the list of chip design bugs that can only be found cheaply
by using GLS.  Keep in mind, I'm in gate-sims.  This is at the tail end of
the project where the design team tells me "this chip is ready!  We're good
to go!"; and then I've caught least 1 of these chip-killer bugs after that
point.

  1. Timing Bugs.  Using incorrect constraints actually cause your DC or
     Genus synthesis tool to create timing bugs -- and then those same
     bad constraints are used to run Primetime or Tempus STA.  So the
     same constraint error will cause both the bug and the bad check that
     will miss detecting that bug.

  2. Linting Bugs.  Lint tools like Real Intent or Spyglass look at your
     source Verilog/SystemVerilog/VHDL RTL code that produces bad gates.
     The bad news is that some lint tools (I won't say which ones) have a
     signal-to-noise ratio that produce too many warnings that need to be
     reviewed and waived by hand.  The human error of waiving the wrong lint
     warning creates a difference between RTL and gate functionality.  And
     worst, LEC will not find these waivered bugs since LEC starts with
     the same wrong gate functionality.

  3. BFM-masked Bugs.  RTL verification typically uses BFMs (Bus Functional
     Models) to simplify test generation and checking of results.  BFMs that
     incorrectly model part of your DUT can cause bugs to be missed.  Your
     GLS must do some tests driven by gate-level cores instead of just
     internal BFMs.

  4. IP Bugs.  You can have 3rd party IP that works perfectly in RTL, but
     quietly contains timing/functional/ifdef/pragma bugs that can only be
     caught in GLS.  These quiet IP bugs can kill a chip.

  5. Clocking Bugs.  Your RTL has quiet real-life glitches, over max
     frequencies, or duty cycle bugs are often only seen in GLS tests
     with full SDF timing.

  6. Reset Timing Bugs.  These are typically clock zones where the reset
     is released at different clock edges on their D-FF's.  These are also
     called initialization bugs.  They can only be detected in gate
     simulations with delays.

  7. `ifdef Bugs.  From `ifdefs in your code where RTL simulation uses one
     set of `ifdefs different from the `ifdefs synthesis used.  LEC does
     not catch this and you won't suspect anything until you run GLS.

  8. Dynamic Frequency Change Clock Bugs.  Often high performance, yet low
     power chips must be able to switch frequencies without quiescing its
     logic.  This logic can only be verified with GLS with full timing
     to detect these clock issues.

  9. Multi-Cycle Path (MCP) Bugs.  For example, you have a chip with a
     12-cycle MCP in it.

       a) your source signals must be held stable for the full 12-cycle
          period,

     and

       b) your destination flops must only capture the results at the
          12th cycle -- and not earlier nor later.

     If you fail to do "a" or "b" above, it will create an MCP set-up/hold
     issue that causes metastability ("X's") on your final output flop.

 10. Force/Release Bugs.  Often in testbenches to get past some bottleneck,
     code like this will be used:
     
                   force load_fifo_name_here = 1'b1;
                   force ecc_error = 1'b0;
                   force aix_bus = 32'bFFFFFFFF;

     What happens is the verification guys forget to remove or "release"
     all or some of these "force" commands -- causing tests to pass and
     their bugs to go undetected.  GLS throws up compile errors for most
     internal "forces" when signals are renamed through synthesis; with
     the few "forces" remaining to be reviewed and removed if possible.

 11. BIST/BISR Bugs.  If your design's original Veilog/VHDL source RTL
     does not include BISR or BIST logic, bugs involving the BIST/BISR
     logic can only be found in GLS.

 12. DFT Bugs.  Usually RTL does not include DFT logic so those bugs in
     the DFT logic can only be found in GLS.

 13. Power Insertion Bugs.  Usually your RTL is not power inserted.  UPF
     testing is an attempt to find these in RTL, but since most power
     logic is not included in RTL, the only true test of power logic can
     only be done with a "power aware" gate-level simulation.  This is
     where the simulation models of your gate-level library cells only work
     when your power-enabled netlist are connected and driven correctly
     by your clamp cells, voltage translators, and power islands.

 14. Delta-Delay Race Conditions.  Occasionally RTL is run with #0 or #1
     or blocking and non-blocking assignments that include a RTL delta delay
     race condition.  These are simulation artifacts.  If your source RTL
     simulates wrong, people will design their chip to "pass" wrong.  They
     are assuming everything is OK.  But their final Gates will work
     differently than their RTL -- and this is a rare case where only GLS
     will detect the real silicon behavior mismatch.

 15. LEC Holes.  All LEC tools work by doing a logical equivalence between
     two gate-level models.  If you're doing LEC between your RTL and your
     Gates of the same design, the LEC tool starts by doing a synthesis of
     your RTL to a simple gate implementation.  If your "RTL" gates does not
     100% synthesize to gates that 100% match your RTL's functionality, your
     LEC run will be comparing your Gate netlist with an "RTL" gate model
     that is already broken.  You can get incorrect LEC results from this.

 16. LEC Waivers.  Large designs are divided into pieces to allow LEC to
     handle it within a reasonable time.  Any tool mistake in this cutting
     process or any incorrect waivers can result in functional differences
     between RTL and Gates.  Only GLS detects this.

With all this warning, I keep expecting my next chip to be the one where GLS
does not find a chip-killing bug that formal, STA, ABV, lint, and emulation
didn't catch.  At 22 chips it still hasn't happened.

    - Dan Joyce
      Correct Designs, Inc.                      Austin, TX

P.S. What follows are details of how I found these 16 bugs types using GLS.