Embedded software 5 most destructive bugs

By Hai Shalom

原文地址: http://www.rt-embedded.com/blog/archives/5-most-destructive-bugs/

===============================================================================================================================

We all make mistakes, and it is only natural. However, if the product is already running in the field, mistakes we do in the code which are not detected on time, can cause:

  • Damage to customers,
  • Massive recalls,
  • Massive Firmware update requirement,
  • Failure in field tests,
  • Loss of brand credibility,
  • Lot’s of wasted $$$,
  • Travel to customer’s/OEM site,
  • You loosing your position, or job ? :)

The error types I am going to present are generic errors, which are not tied to specific architectures. A bug which belongs to these error types may be hidden, hard to find and hard to reproduce. The system may fail randomly or unexpectedly (Referred as the “Voodoo effect”).

We can never guarantee that the software we provide is error free. What can we do in order to minimize the chance for these errors to pop up? Well, just continue to read.

1. Buffer overflow

Buffer overflow is a large family of errors, and could be the result of various bugs:

  1. Static buffer overflow: Writing, reading or copying data to or from a buffer in a larger amount than its maximum size. A buffer could be a string buffer, a structured variable, an array or a native type variable, either global or on the stack.
  2. Dynamic buffer overflow: Similar to 1, with the difference of the buffer being allocated dynamically.
  3. Array overflow: Misunderstanding or confusion about the last array index. An array in the size of X has the indexes of 0 to X-1.

How to avoid and detect:

  1. When you write new code, pay attention to the size of each variable, or buffer you need to access.
  2. Pay close attention to the memory read/write functions such as memset, memcpy, memmove and similar. Make sure that the destination buffer is large enough to accept the worst case scenario in terms of data size (maximum name length, maximum amount of bytes, largest data type).
  3. When working with string functions such as strcpy, strcat  and similar, make sure that the buffer is terminated with a NULL character (‘\0′), otherwise, the string function will continue to process your buffer until a NULL character will be found somewhere in the memory.
  4. Similarly to 2 and 3, when using string format functions like sprintf and similar functions.
  5. When working with strings, try to use the bounded versions of the string functions like: strncpy, strncat, snprintf. These functions accept another parameter in which you can specify your buffer length. They will limit the maximum bytes they write to the destination buffer. Note that in case these functions hit your number, they will stop their processing and will not place a NULL character. Therefore, the correct usage of these functions is; first, clear your buffer using memset, and then, let the string function process your buffer by specifying its length – 1.
  6. When working with arrays, and the array is defined in the size of ARRAY_SIZE:
    • Never access the array in the index of ARRAY_SIZE. It’s one index outside of its boundary.
    • When using for loops, the end condition for the last entry must be smaller than ARRAY_SIZE and not smaller or equal.

Buffer overflow may result in Segmentation fault crash (in user space) or a kernel panic (for code running in kernel space).

2. Memory (or resource) leaks

Memory leaks usually kill your system slowly and painfully. After a fresh boot, everything looks ok and the system works fine. Depending on the total amount of RAM and the extent of the memory leak, the system will continue to run and perform well for a certain amount of time. It could be hours, days, or weeks. During this time, the amount of free memory is constantly decreased. In Linux based systems, when memory is required and there is no free memory, the  kernel will start paging out programs and clearing the page cache. Further memory requests will cause performance impact due to paging out living tasks. Eventually, the kernel will trigger the Out-Of-Memory killer and start killing processes. In this stage, the system may be unusable already. The last stage will cause the kernel to panic and hang/reboot. Many consumer electronics products in the field remain active their whole life, therefore you can’t allow any memory leaks to happen, or else the customer will have to power cycle the unit from time to time. Memory leaks are also hard to notice during unit-tests or lab tests because in these scenarios, the unit is rebooted constantly, as part of the development or testing work.

Memory leaks occur due to continuous resource allocation without freeing it. A system can process numerous amount of data, but if we’ll examine it in snapshots, at any given time its resource allocation is bounded. In order to keep the system functional, the resource must be freed when it is not required anymore, or when the work on it has been done.

Memory leaks could be the results of the following:

  1. Allocating a memory buffer using functions similar to malloc (user) or kmalloc (kernel) and not freeing it.
  2. Opening files, sockets or pipes and not closing them.
  3. A kernel network driver which processes a packet in an skb structure and doesn’t free it when it is finished.

How to avoid and detect:

  1. When a resource is temporary allocated inside a function, go through all flow options and make sure the resource is freed before each return keyword. In some cases, usually error cases when the function exits prematurely, the resource free handling is forgotten. For example, assuming a resource was allocated, and then, an if statement that checks a condition may cause the function to return an error. Usually the if statement continues correctly and the resource is used and freed, but in case of an error, the function may return earlier.
  2. Some functions may allocate memory for you and return a pointer to the allocated buffer. Make sure you free it once your processing is complete.
  3. During long test trials, run “free” or  ”top” in the background, and monitor any memory reduction (with free) or memory incrementation of a task (with top).

3. Ignoring compiler warnings or fixing them without understanding

A common bad practice is to ignore compiler warnings. Indeed, there are cases that the warnings are not important, but there are cases where warnings may indicate a real problem. A few examples:

  1. Missing return value.
  2. Incompatible variable assignment.
  3. Mismatch between pointers and variables.
  4. Using an uninitialized varialbe.

Any many other examples.

An even worse practice is to fix warnings without really understanding them, just to make the compiler quiet. A nice story about a developer who “fixed” a warning reported by a code inspection without really understanding what he did can be found here.

How to avoid:

  1. Write a simple well organized code. Writing “clever” code is not fashionable and makes it hard for everybody else to understand and fix.
  2. Enable the  -Wall switch in the gcc. It will enable all warnings report.
  3. Never finish a module with remaining warnings. If you won’t fix the first one as it appears, soon there will be more and more until there will be too many. If you coded this module, you know best, right now, how to fix it. Other people might not understand what was your intension, and even you may forget it after a while.
  4. Once the module is warning-free, it is recommended to configure the gcc to treat all warnings as errors using the  -Werrorswitch. This will keep the module warning free for good.
  5. Never fix a warning that you have doubts about. Many warnings are straight forward, but some may require you to ask the person who wrote the module, or to consult with other people.

4. Race conditions

Race condition is a scenario when two or more contexts are using the same resource in parallel without considering each other. Race condition could be a result of:

  1. 2 or more contexts are trying to access the same data, variable or memory in the same time.
  2. 2 or more contexts are calling a non-reentrant function in the same time.

Contexts could be either kernel threads or user space threads inside a process. A single process is more protected, however, it could be exposed to the same issues in case 2 or more processes are using a shared memory region, or calling a non-reentrant function in a shared library.

Race conditions could result in a randon an unpredictable behavior of the system due to mutual data corruption (one context may corrupt the data or flow other other one), and it may be hard to detect or reproduce such bugs.

How to avoid:

  1. Try to avoid creating and using global variables.
  2. Protect global variables and shared memory regions with semaphores or mutexes. Each context “locks” the resource until it finishes the processing (be careful from deadlocks).
  3. Write your functions to be reentrant. If the function accesses a shared resouce and can’t be reentrant, apply a mutex that will allow only single usage at a time. A second context calling the same function will go to sleep until it is freed.
  4. Try to minimize the use of external non-reentrant functions (such as strtok), use reentrant alternative (such as strtok_r).

5. Alignment traps

Alignment traps are crashes that occur due to misaligned memory access on 32-bit CPUs. Such scenarios could happen when porting code to a new architecture, and when using wrong methods to access raw network packets or data storage with varied length. This error condition occurs when the CPU is requested to read a 32-bit variable from an odd address (which is used by 8-bit variables) or an address that is used by 16-bit variables.  For example, suppose you receive an IPv4 packet and you want to read the first 32-bits of the IP header (contains information such as version and total length). The IP portion of the packet is concatenated after the Ethernet header which is composed of 14 bytes (6 source address, 6 destination address, 2 type). If you would try to access the IP packet directly with a 32-bit pointer, you will get an alignment trap because the IP packet data starts at offset 14, and the CPU can read a 32-bit variable from either address 12 or 16 (4 bytes alignment). Another example, suppose there is a function that retreives data from a database. The data size could be from 1 byte to 1024 bytes, and the function returns a void pointer. If you’ll cast this pointer to a 32-bit pointer (such as unsigned integer) in order to read the values, you will get an alignment trap in case the function returns a 1 byte (char) or 2 bytes (short) of data. Whilst the first example is easy to track and fix (happens always), the latter might be difficult, depending on the amount of times the function really returns a pointer to an 8-bit variable.

How to avoid and detect:

  1. Be careful with pointer casting. If there is a chance that the pointer points to an array of bytes or shorts, or comes from a packed structure, you must not cast this pointer to a 32-bit type. Instead, copy the pointer’s value to a local variable, and do the processing on it.
  2. Don’t ignore compiler warnings about pointer mismatch. Make sure that the right pointers are used.

Further actions to take

Understanding these errors and the ways to avoid them is the first step towards a better software.  The following actions will also help you avoid critical errors and increase the system stability and robustness:

  1. Create Design Documents. They will help you with the implementation and later with debug.
  2. Hold code reviews. Sometimes, another pair of eyes will see things you missed.
  3. If you have a doubt, consult with other people.
  4. Keep the code simple and organized. “Clever” code is not appreciated.
  5. There are code inspection utilties which you can use, some are open sourced (such as  Valgrind), and some are commercial. These tools examine your code and provide a detailed report about potential bugs. Note that there are many false alarms, and some issues could be there on purpose.
  6. And last, but not least; Read my  blog :)
Check out the ads, there could be something that may interest you there. The ads revenue helps me to pay for the domain and storage.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值