原文:Dan Joyce's 16 bug gate-level simulation Subject: Dan Joyce's 16 bug types only found with gate-level simulation
The three takeaways I got from Wally Rhines' DVcon'16 keynote were:
- Static verification is fastest growing category in EDA. Emulation closely follows as number 3. (#2 was not mentioned)
- New kinds of focused verification will continue be adopted, such as reset-domain crossing and constraints verification. (At Real Intent, we see new requirements for focused static solutions at the gate-level also.)
- ...
Followed by Jim Hogan doing a panel at that DVcon asking:
"Q. Will formal and emulation replace SW simulation tools?"
Followed by Lauro Rizzatti saying:
"As my colleagues have said, complexity is killing simulation. There is good news in the emulation camp, which I have followed for 22 years. Today emulation does most of what simulation does with the exception of timing and multi-value logic. It works 4 to 6 orders of magnitude faster."
And Brian Hunter of Cavium:
"Anytime we find a bug with an emulation platform, we always go back and question what did we do wrong in simulation." ...
- Prakash on DVcon'16, portable stimulus, & end of simulation
From: "Dan Joyce" <user=danj domain=correctdesigns not calm> Hi, John, First off, I want to thank Prakash for writing his DVcon'16 Trip Report. I love the fact that these critical issues are being discussed and then disseminated in such a public way. I couldn't make that DVcon, but I found it disturbing to see how many engineers were nonchalantly drinking
| | |
the Jonestown Kool-Aid that formal tools and much faster emulation boxes are going to replace old school Verilog/VHDL RTL software simulators in chip design -- and especially with that truly dangerous notion of not bothering to do gate-level simulations at all -- or even considering just dropping timing from their Gatesims altogether. If I was there at that DVcon, I would have been seriously fighting the urge to shout down those engineers thinking of skipping gate-level sims. Why should anyone listen to Dan Joyce about gate-level sims? I have 25 years in chip design and verification. I taped-out 22 chips in that time and only had one of those 22 chips back in 1995 had to be respun. So my record is 21 out of 22 -- which is a 95.5% success rate -- something I'm proud of. Many of these were huge chips pushing the technology of the day. My most recent chip was a 1.25 billion instance 16nm TSMC FinFET design. ---- ---- ---- ---- ---- ---- ---- WHY YOU MUST STILL DO GATE-LEVEL SIMULATIONS (GLS) TODAY Using gate-level simulations (GLS), I've found both functional and timing bugs in chips at every stage of chip design -- from the start of early functional development -- all the way down to subtle yet chip-fatal timing bugs just 2 days before final tapeout. Going in, I have to issue a "heads-up" warning: GLS can be an extremely expensive task that fails to find critical design flaws before chips are released to manufacturing -- if done wrong. However if done right I see the Cost/Benefit of Gatesims getting better today. Regardless, GLS finds chip-killing bugs that formal, STA, ABV, lint, and emulation won't even notice. You are taking a big risk if you don't do GLS. That said, let me set the stage... THAT DAMNED ZERO-DELAY PROBLEM... The goal of any chip development team is to have processes (lint, LEC, STA, verification) so solid that GLS will never find any bugs and not be needed. But GLS does find bugs; most very late in the design process and close to tape-out. GLS bugs are often very serious, and tend to cause problems with no possible workaround. Even when workarounds are possible, finding and debugging GLS failures is much easier than debugging silicon in the lab. It's not uncommon to chase bugs in the lab for months which could have been found and debugged in a couple hours in GLS. Although GLS is harder to debug than RTL, it is much easier than silicon. GLS bugs exist because practically all chip Verilog/SystemVerilog/VHDL simulations are done with "ideal world" zero-delay tests that are run on pre-synthesized Veilog/VHDL/SystemVerilog RTL code. These sims are tailored to speed-up simulation runtime performance -- but sacrifice their ability to catch certain types of bugs. Additionally, many steps are performed on the design after the RTL is verified to produce the gate netlist that is used to manufacture the silicon. Some of these post-RTL steps include synthesis to gates; place and route; power insertion; adding logic for Built In Self Test (BIST) and Built In Self Repair (BISR); and insertion of Design For Testability (DFT) logic. RTL tests won't find bugs from any of those steps since the logic added during those steps weren't in the orginal RTL to begin with! In addition GLS finds chip timing issues missed by Primetime/Tempus STA due to bad timing constraints. The gate-level model is much closer to the real silicon design -- testing at gates frequently finds timing bugs in asynchronous logic that cannot be found in RTL. SPECIAL WARNING: GLS has gotten harder as today's chip designs have gotten bigger; but at the same time the need for GLS is greater than ever before. The amount of logic in your chip that does not exist in the RTL design has increased. Add the complexity of 100's of unrelated clocks has increased the risk of your final physical blocks not closing timing dramatically in the last few years. Finally as the man-hours and calendar time needed to go from tapeout to silicon has increased -- making the cost of a gate-level bug much higher -- and making it even more important than ever to get back good working silicon on the 1st pass. DAN JOYCE'S 16 BUG TYPES ONLY FOUND BY GATE-LEVEL SIMULATION The following is the list of chip design bugs that can only be found cheaply by using GLS. Keep in mind, I'm in gate-sims. This is at the tail end of the project where the design team tells me "this chip is ready! We're good to go!"; and then I've caught least 1 of these chip-killer bugs after that point. 1. Timing Bugs. Using incorrect constraints actually cause your DC or Genus synthesis tool to create timing bugs -- and then those same bad constraints are used to run Primetime or Tempus STA. So the same constraint error will cause both the bug and the bad check that will miss detecting that bug. 2. Linting Bugs. Lint tools like Real Intent or Spyglass look at your source Verilog/SystemVerilog/VHDL RTL code that produces bad gates. The bad news is that some lint tools (I won't say which ones) have a signal-to-noise ratio that produce too many warnings that need to be reviewed and waived by hand. The human error of waiving the wrong lint warning creates a difference between RTL and gate functionality. And worst, LEC will not find these waivered bugs since LEC starts with the same wrong gate functionality. 3. BFM-masked Bugs. RTL verification typically uses BFMs (Bus Functional Models) to simplify test generation and checking of results. BFMs that incorrectly model part of your DUT can cause bugs to be missed. Your GLS must do some tests driven by gate-level cores instead of just internal BFMs. 4. IP Bugs. You can have 3rd party IP that works perfectly in RTL, but quietly contains timing/functional/ifdef/pragma bugs that can only be caught in GLS. These quiet IP bugs can kill a chip. 5. Clocking Bugs. Your RTL has quiet real-life glitches, over max frequencies, or duty cycle bugs are often only seen in GLS tests with full SDF timing. 6. Reset Timing Bugs. These are typically clock zones where the reset is released at different clock edges on their D-FF's. These are also called initialization bugs. They can only be detected in gate simulations with delays. 7. `ifdef Bugs. From `ifdefs in your code where RTL simulation uses one set of `ifdefs different from the `ifdefs synthesis used. LEC does not catch this and you won't suspect anything until you run GLS. 8. Dynamic Frequency Change Clock Bugs. Often high performance, yet low power chips must be able to switch frequencies without quiescing its logic. This logic can only be verified with GLS with full timing to detect these clock issues. 9. Multi-Cycle Path (MCP) Bugs. For example, you have a chip with a 12-cycle MCP in it. a) your source signals must be held stable for the full 12-cycle period, and b) your destination flops must only capture the results at the 12th cycle -- and not earlier nor later. If you fail to do "a" or "b" above, it will create an MCP set-up/hold issue that causes metastability ("X's") on your final output flop. 10. Force/Release Bugs. Often in testbenches to get past some bottleneck, code like this will be used: force load_fifo_name_here = 1'b1; force ecc_error = 1'b0; force aix_bus = 32'bFFFFFFFF; What happens is the verification guys forget to remove or "release" all or some of these "force" commands -- causing tests to pass and their bugs to go undetected. GLS throws up compile errors for most internal "forces" when signals are renamed through synthesis; with the few "forces" remaining to be reviewed and removed if possible. 11. BIST/BISR Bugs. If your design's original Veilog/VHDL source RTL does not include BISR or BIST logic, bugs involving the BIST/BISR logic can only be found in GLS. 12. DFT Bugs. Usually RTL does not include DFT logic so those bugs in the DFT logic can only be found in GLS. 13. Power Insertion Bugs. Usually your RTL is not power inserted. UPF testing is an attempt to find these in RTL, but since most power logic is not included in RTL, the only true test of power logic can only be done with a "power aware" gate-level simulation. This is where the simulation models of your gate-level library cells only work when your power-enabled netlist are connected and driven correctly by your clamp cells, voltage translators, and power islands. 14. Delta-Delay Race Conditions. Occasionally RTL is run with #0 or #1 or blocking and non-blocking assignments that include a RTL delta delay race condition. These are simulation artifacts. If your source RTL simulates wrong, people will design their chip to "pass" wrong. They are assuming everything is OK. But their final Gates will work differently than their RTL -- and this is a rare case where only GLS will detect the real silicon behavior mismatch. 15. LEC Holes. All LEC tools work by doing a logical equivalence between two gate-level models. If you're doing LEC between your RTL and your Gates of the same design, the LEC tool starts by doing a synthesis of your RTL to a simple gate implementation. If your "RTL" gates does not 100% synthesize to gates that 100% match your RTL's functionality, your LEC run will be comparing your Gate netlist with an "RTL" gate model that is already broken. You can get incorrect LEC results from this. 16. LEC Waivers. Large designs are divided into pieces to allow LEC to handle it within a reasonable time. Any tool mistake in this cutting process or any incorrect waivers can result in functional differences between RTL and Gates. Only GLS detects this. With all this warning, I keep expecting my next chip to be the one where GLS does not find a chip-killing bug that formal, STA, ABV, lint, and emulation didn't catch. At 22 chips it still hasn't happened. - Dan Joyce Correct Designs, Inc. Austin, TX P.S. What follows are details of how I found these 16 bugs types using GLS.