编译用于高放射性环境的应用程序

最新推荐文章于 2024-09-12 23:51:52 发布

p15097962069

最新推荐文章于 2024-09-12 23:51:52 发布

阅读量326

点赞数

文章标签： c++ c gcc embedded fault-tolerance

原文链接：https://oldbug.net/q/2UWYV/Compiling-an-application-for-use-in-highly-radioactive-environments

版权

本文翻译自：Compiling an application for use in highly radioactive environments

We are compiling an embedded C/C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation . 我们正在编译一个嵌入式C / C ++应用程序，该应用程序部署在受到电离辐射轰击的环境中的屏蔽设备中。 We are using GCC and cross-compiling for ARM. 我们正在使用GCC并为ARM进行交叉编译。 When deployed, our application generates some erroneous data and crashes more often than we would like. 部署后，我们的应用程序会生成一些错误数据，并且崩溃的次数比我们想要的要多。 The hardware is designed for this environment, and our application has run on this platform for several years. 硬件是为此环境设计的，我们的应用程序已在该平台上运行了几年。

Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errors and memory-corruption caused by single event upsets ? 我们可以对代码进行更改吗，还是可以在编译时进行改进，以识别/纠正由单个事件引发的软错误和内存损坏？ Have any other developers had success in reducing the harmful effects of soft errors on a long-running application? 其他开发人员是否在减少软错误对长期运行的应用程序的有害影响方面取得了成功？

#1楼

参考：https://stackoom.com/question/2UWYV/编译用于高放射性环境的应用程序

#2楼

NASA has a paper on radiation-hardened software. 美国国家航空航天局（NASA）发表了一篇关于辐射增强软件的论文。 It describes three main tasks: 它描述了三个主要任务：

Regular monitoring of memory for errors then scrubbing out those errors, 定期监视内存中的错误，然后清除这些错误，
robust error recovery mechanisms, and 强大的错误恢复机制，以及
the ability to reconfigure if something no longer works. 如果某些东西不再起作用，则可以重新配置。

Note that the memory scan rate should be frequent enough that multi-bit errors rarely occur, as most ECC memory can recover from single-bit errors, not multi-bit errors. 请注意，内存扫描速率应足够频繁，以至于很少会发生多位错误，因为大多数ECC内存可以从单位错误而非多位错误中恢复。

Robust error recovery includes control flow transfer (typically restarting a process at a point before the error), resource release, and data restoration. 强大的错误恢复包括控制流传输（通常在错误发生之前的某个时刻重新启动进程），资源释放和数据恢复。

Their main recommendation for data restoration is to avoid the need for it, through having intermediate data be treated as temporary, so that restarting before the error also rolls back the data to a reliable state. 他们对数据恢复的主要建议是，通过将中间数据视为临时数据，从而避免了对数据的需求，以便在错误之前重新启动也会将数据回滚到可靠状态。 This sounds similar to the concept of "transactions" in databases. 这听起来类似于数据库中“事务”的概念。

They discuss techniques particularly suitable for object-oriented languages such as C++. 他们讨论了特别适用于面向对象语言（例如C ++）的技术。 For example 例如

Software-based ECCs for contiguous memory objects 连续内存对象的基于软件的ECC
Programming by Contract : verifying preconditions and postconditions, then checking the object to verify it is still in a valid state. 按合同编程：验证前提条件和后置条件，然后检查对象以确认其仍然处于有效状态。

And, it just so happens, NASA has used C++ for major projects such as the Mars Rover . 而且，正是这种情况，NASA已将C ++用于诸如Mars Rover之类的大型项目。

C++ class abstraction and encapsulation enabled rapid development and testing among multiple projects and developers. C ++类抽象和封装可在多个项目和开发人员之间进行快速开发和测试。

They avoided certain C++ features that could create problems: 他们避免使用某些可能导致问题的C ++功能：

Exceptions 例外情况
Templates 范本
Iostream (no console) iostream（无控制台）
Multiple inheritance 多重继承
Operator overloading (other than new and delete ) 运算符重载（除了new和delete ）
Dynamic allocation (used a dedicated memory pool and placement new to avoid the possibility of system heap corruption). 动态分配（使用了专用的内存池和new分配以避免系统堆损坏的可能性）。

#3楼

You may also be interested in the rich literature on the subject of algorithmic fault tolerance. 您可能也对关于算法容错的丰富文献感兴趣。 This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n) for n comparisons). 这包括旧的赋值：写一个排序，当恒定数量的比较将失败时（或者，当渐进失败的比较的渐近数量缩放为log(n) n比较的log(n)时，对输入进行正确排序）。

A place to start reading is Huang and Abraham's 1984 paper " Algorithm-Based Fault Tolerance for Matrix Operations ". 开始阅读的地方是Huang和Abraham在1984年发表的论文“ 矩阵运算的基于算法的容错 ”。 Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level). 他们的想法大概与同态加密计算相似（但是实际上并不太一样，因为他们正在尝试在操作级别进行错误检测/纠正）。

A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's " Algorithm-based fault tolerance applied to high performance computing ". 该论文的最新版本是Bosilca，Delmas，Dongarra和Langou的“ 基于算法的容错应用于高性能计算 ”。

#4楼

Here are some thoughts and ideas: 这里有一些想法和想法：

Use ROM more creatively. 更创造性地使用ROM。

Store anything you can in ROM. 将任何可以存储的内容存储在ROM中。 Instead of calculating things, store look-up tables in ROM. 无需计算内容，而是将查找表存储在ROM中。 (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. （确保您的编译器将查找表输出到只读部分！在运行时打印出内存地址以进行检查！）将中断向量表存储在ROM中。 Of course, run some tests to see how reliable your ROM is compared to your RAM. 当然，请运行一些测试以查看ROM与RAM相比的可靠性。

Use your best RAM for the stack. 将最佳RAM用于堆栈。

SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live. 堆栈中的SEU可能是最有可能导致崩溃的原因，因为它是索引变量，状态变量，返回地址和各种指针通常存在的地方。

Implement timer-tick and watchdog timer routines. 实现计时器滴答和看门狗计时器例程。

You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. 您可以在每个计时器滴答时运行“健全性检查”例程，以及用于处理系统锁定的看门狗例程。 Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred. 您的主代码还可以定期增加一个计数器来指示进度，并且完整性检查例程可以确保这种情况已经发生。

Implement error-correcting-codes in software. 在软件中实施纠错代码。

You can add redundancy to your data to be able to detect and/or correct errors. 您可以为数据添加冗余，以便能够检测和/或纠正错误。 This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off. 这将增加处理时间，可能会使处理器长时间暴露在辐射下，从而增加出错的机会，因此您必须权衡取舍。

Remember the caches. 记住缓存。

Check the sizes of your CPU caches. 检查您的CPU缓存的大小。 Data that you have accessed or modified recently will probably be within a cache. 您最近访问或修改的数据可能会在缓存中。 I believe you can disable at least some of the caches (at a big performance cost); 我相信您可以禁用至少某些缓存（以较高的性能代价）； you should try this to see how susceptible the caches are to SEUs. 您应该尝试这样做以查看缓存对SEU的敏感程度。 If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line. 如果缓存比RAM硬，那么您可以定期读取和重写关键数据，以确保它们保留在缓存中并使RAM恢复正常。

Use page-fault handlers cleverly. 聪明地使用页面错误处理程序。

If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. 如果将内存页面标记为不存在，则当您尝试访问该页面时，CPU将发出页面错误。 You can create a page-fault handler that does some checking before servicing the read request. 您可以创建一个页面错误处理程序，在处理读取请求之前进行一些检查。 (PC operating systems use this to transparently load pages that have been swapped to disk.) （PC操作系统使用它来透明地加载已交换到磁盘的页面。）

Use assembly language for critical things (which could be everything). 使用汇编语言处理关键的事情（可能是所有事情）。

With assembly language, you know what is in registers and what is in RAM; 使用汇编语言，您知道寄存器中的内容和RAM中的内容。 you know what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down. 您知道 CPU使用的是什么特殊的RAM表，并且可以通过回旋方式进行设计以降低风险。

Use objdump to actually look at the generated assembly language, and work out how much code each of your routines takes up. 使用objdump实际查看生成的汇编语言，并计算出每个例程占用多少代码。

If you are using a big OS like Linux then you are asking for trouble; 如果您使用的是像Linux这样的大型操作系统，那您就麻烦了； there is just so much complexity and so many things to go wrong. 有这么多的复杂性和很多错误要解决。

Remember it is a game of probabilities. 请记住，这是一场概率游戏。

A commenter said 评论者说

Every routine you write to catch errors will be subject to failing itself from the same cause. 您编写的每个捕获错误的例程都可能因相同的原因而失败。

While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. 虽然这是事实，但检查例程正常运行所需的（例如）100字节代码和数据中的错误几率比其他地方的错误几率小得多。 If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better. 如果您的ROM非常可靠，并且几乎所有代码/数据实际上都在ROM中，那么您的几率甚至更高。

Use redundant hardware. 使用冗余硬件。

Use 2 or more identical hardware setups with identical code. 使用2个或更多具有相同代码的相同硬件设置。 If the results differ, a reset should be triggered. 如果结果不同，则应触发重置。 With 3 or more devices you can use a "voting" system to try to identify which one has been compromised. 对于3台或3台以上的设备，您可以使用“投票”系统来尝试确定哪些设备已受到威胁。

#5楼

Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites *, I would like to share my experience here. 我致力于软件/固件开发和小型卫星的环境测试*大约4-5年，我想在这里分享我的经验。

*( miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components ) *（由于小型卫星的电子部件尺寸相对较小且尺寸有限，因此小型卫星比大型卫星更容易发生单事件失败 ）

To be very concise and direct: there is no mechanism to recover from detectable, erroneous situation by the software/firmware itself without , at least, one copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional). 非常简洁和直接：没有机制把由软件/固件本身检测，错误的情况中恢复没有，至少，软件/固件恢复目的地方的最低工作版本的一个副本 -与硬件支持恢复 （功能性）。

Now, this situation is normally handled both in the hardware and software level. 现在，通常在硬件和软件级别都可以处理这种情况。 Here, as you request, I will share what we can do in the software level. 在这里，根据您的要求，我将分享我们在软件级别上可以做的事情。

...recovery purpose... . ...恢复目的... Provide ability to update/recompile/reflash your software/firmware in real environment. 提供在真实环境中更新/重新编译/重新刷新软件/固件的功能。 This is an almost must-have feature for any software/firmware in highly ionized environment. 对于高度电离的环境，这是几乎所有软件/固件都必须具备的功能。 Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. 否则，您可能会拥有任意数量的冗余软件/硬件，但有一点，它们都将崩溃。 So, prepare this feature! 因此，准备此功能！
...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. ...最低工作版本...在代码中具有响应性的多个副本，最低版本的软件/固件。 This is like Safe mode in Windows. 这就像Windows中的安全模式。 Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. 拥有一个最低功能版本的软件/固件，而不是仅拥有一个功能完整的软件版本。 The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features: 最小副本的大小通常比完整副本小得多，并且几乎总是只有以下两个或三个功能：
1. capable of listening to command from external system, 能够听取来自外部系统的命令，
2. capable of updating the current software/firmware, 能够更新当前的软件/固件，
3. capable of monitoring the basic operation's housekeeping data. 能够监视基本操作的内务处理数据。
...copy... somewhere... Have redundant software/firmware somewhere. ...在某处复制...在某处有冗余软件/固件。
1. You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. 无论有没有冗余硬件，您都可以尝试在ARM uC中拥有冗余软件/固件。 This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. 通常，这是通过在单独的地址中使用两个或多个相同的软件/固件来相互发送心跳来完成的，但是一次只能激活一个。 If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. 如果已知一个或多个软件/固件无响应，请切换到其他软件/固件。 The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)). 使用这种方法的好处是，发生错误后，我们可以立即进行功能替换-无需与负责检测和修复错误的任何外部系统/当事方进行任何联系（在卫星情况下，通常是任务控制中心（ MCC））。
  Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. 严格来说，没有冗余硬件，这样做的缺点是您实际上无法消除所有单点故障。 At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). 至少，您仍然会有一个单点故障，这就是开关本身 （或者通常是代码的开头）。 Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. 但是，对于在高度电离的环境中受尺寸限制的设备（例如，微微/毫微微卫星），仍然需要考虑将单点故障减少到一个点而无需额外的硬件。 Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it. 此外，用于切换的代码段肯定会比整个程序的代码段少得多-大大降低了获得单个事件的风险。
2. But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre). 但是，如果您不这样做，则您的外部系统中应该至少有一个副本，可以与该设备联系并更新软件/固件（在卫星情况下，它又是任务控制中心）。
3. You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware 您还可以将副本保存在设备的永久存储器中，以触发该副本以还原正在运行的系统的软件/固件
...detectable erroneous situation.. The error must be detectable , usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. ...可检测的错误情况。错误必须是可检测的 ，通常是通过硬件错误校正/检测电路或一小段代码进行错误校正/检测。 It is best to put such code small, multiple, and independent from the main software/firmware. 最好将此类代码缩小，多个并独立于主软件/固件。 Its main task is only for checking/correcting. 其主要任务仅用于检查/更正。 If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. 如果硬件电路/固件可靠（例如，其辐射辐射比其余部分更坚固-或具有多个电路/逻辑），那么您可以考虑对其进行纠错。 But if it is not, it is better to make it as error-detection. 但是，如果不是这样，最好将其作为错误检测。 The correction can be by external system/device. 可以通过外部系统/设备进行更正。 For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. 对于纠错，您可以考虑使用诸如Hamming / Golay23之类的基本纠错算法，因为它们可以在电路/软件中更轻松地实现。 But it ultimately depends on your team's capability. 但这最终取决于您团队的能力。 For error detection, normally CRC is used. 对于错误检测，通常使用CRC。
...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. ...支持恢复的硬件现在，这是这个问题上最困难的方面。 Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. 最终，恢复需要负责恢复的硬件至少能够正常运行。 If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. 如果硬件永久损坏（通常在其总电离剂量达到一定水平后发生），则该软件将（很难）无法帮助恢复。 Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite). 因此，对于暴露于高辐射水平的设备（例如卫星），硬件无疑是最重要的问题。

In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have: 除了上述可预见的由于单事件失败而导致的固件错误的建议之外，我还建议您具有：

Error detection and/or error correction algorithm in the inter-subsystem communication protocol. 子系统间通信协议中的错误检测和/或错误校正算法。 This is another almost must have in order to avoid incomplete/wrong signals received from other system 为了避免从其他系统接收到不完整/错误的信号，这是另一个几乎必须具备的条件
Filter in your ADC reading. 过滤ADC读数。 Do not use the ADC reading directly. 不要使用ADC直接读取。 Filter it by median filter, mean filter, or any other filters - never trust single reading value. 用中值过滤器，均值过滤器或任何其他过滤器过滤它- 永远不要相信单个读数值。 Sample more, not less - reasonably. 采样更多而不是更少-合理。

#6楼

It may be possible to use C to write programs that behave robustly in such environments, but only if most forms of compiler optimization are disabled. 仅在大多数形式的编译器优化被禁用的情况下，才可以使用C编写在这种环境下表现良好的程序。 Optimizing compilers are designed to replace many seemingly-redundant coding patterns with "more efficient" ones, and may have no clue that the reason the programmer is testing x==42 when the compiler knows there's no way x could possibly hold anything else is because the programmer wants to prevent the execution of certain code with x holding some other value--even in cases where the only way it could hold that value would be if the system received some kind of electrical glitch. 优化的编译器旨在用“更有效的”编码模式替换许多看似冗余的编码模式，并且可能不知道当编译器知道x不可能容纳任何其他内容时，程序员测试x==42的原因是因为程序员希望阻止x持有其他值的情况下执行某些代码-即使在唯一可以保存该值的方式是系统收到某种电子故障的情况下。

Declaring variables as volatile is often helpful, but may not be a panacea. 将变量声明为volatile通常会有所帮助，但可能不是万能药。 Of particular importance, note that safe coding often requires that dangerous operations have hardware interlocks that require multiple steps to activate, and that code be written using the pattern: 特别重要的是，请注意，安全编码通常要求危险的操作具有硬件互锁，需要多个步骤才能激活，并且使用以下模式编写代码：

... code that checks system state
if (system_state_favors_activation)
{
  prepare_for_activation();
  ... code that checks system state again
  if (system_state_is_valid)
  {
    if (system_state_favors_activation)
      trigger_activation();
  }
  else
    perform_safety_shutdown_and_restart();
}
cancel_preparations();

If a compiler translates the code in relatively literal fashion, and if all the checks for system state are repeated after the prepare_for_activation() , the system may be robust against almost any plausible single glitch event, even those which would arbitrarily corrupt the program counter and stack. 如果编译器以相对原义的方式转换代码，并且在prepare_for_activation()之后重复了对系统状态的所有检查，则该系统可以抵抗几乎任何可能的单个故障事件，甚至那些会任意破坏程序计数器和堆。 If a glitch occurs just after a call to prepare_for_activation() , that would imply that activation would have been appropriate (since there's no other reason prepare_for_activation() would have been called before the glitch). 如果在调用prepare_for_activation()之后发生小故障，则意味着激活是适当的（因为没有其他原因会在小故障之前调用prepare_for_activation() ）。 If the glitch causes code to reach prepare_for_activation() inappropriately, but there are no subsequent glitch events, there would be no way for code to subsequently reach trigger_activation() without having passed through the validation check or calling cancel_preparations first [if the stack glitches, execution might proceed to a spot just before trigger_activation() after the context that called prepare_for_activation() returns, but the call to cancel_preparations() would have occurred between the calls to prepare_for_activation() and trigger_activation() , thus rendering the latter call harmless. 如果故障导致代码不适当地到达prepare_for_activation() ，但没有后续故障事件，那么代码将无法随后通过未通过验证检查或先调用cancel_preparations到达trigger_activation() [如果堆栈出现故障，执行可能会在调用prepare_for_activation()返回的上下文之后刚好在trigger_activation()之前执行，但是对cancel_preparations()的调用会在对prepare_for_activation()和trigger_activation()的调用之间发生，因此使后者的调用无害。

Such code may be safe in traditional C, but not with modern C compilers. 这样的代码在传统C语言中可能是安全的，但在现代C编译器中则不是。 Such compilers can be very dangerous in that sort of environment because aggressive they strive to only include code which will be relevant in situations that could come about via some well-defined mechanism and whose resulting consequences would also be well defined. 这样的编译器在那种环境下可能非常危险，因为激进的编译器会努力只包含与某些情况相关的代码，这些情况可能通过某种定义良好的机制发生，并且其后果也将得到很好的定义。 Code whose purpose would be to detect and clean up after failures may, in some cases, end up making things worse. 在某些情况下，目的是要在故障后进行检测和清除的代码可能最终使情况变得更糟。 If the compiler determines that the attempted recovery would in some cases invoke undefined behavior, it may infer that the conditions that would necessitate such recovery in such cases cannot possibly occur, thus eliminating the code that would have checked for them. 如果编译器确定在某些情况下尝试进行的恢复将调用未定义的行为，则它可以推断出在这种情况下可能需要进行这种恢复的条件不可能发生，从而消除了要检查它们的代码。