What is a segmentation fault?
Segmentation fault is the most common error condition where your program tries to access either an invalid memory location, or a memory location it is not allowed to access.
A few examples for this could be:
- Dereferencing a NULL pointer.
- Dereferencing an uninitialized pointer.
- Accessing memory with a wrong alignment.
- Writing to a read only area.
- Writing or reading beyond program allocated resources (buffer overflow).
- Memory corruption/overrun.
For our examples, let’s use the following function that simply writes the character ‘R’ to a given location provided by ptr:
All the following code examples will cause the program to terminate and the message “Segmentation Fault” to appear on the console.
Example 1: Dereference a NULL pointer:
Example 2: Write to a read-only location:
Example 3: Dereference an uninitialized pointer:
What other crash types we usually see?
Another common error is alignment trap (also referred as bus error). A bus error occurs when the CPU tries to access a 32-bit or 16-bit variable in an unaligned memory address.
See the article about alignment traps for more details.
Another type of crash is caused be an illegal instruction. Normally, this cannot happen because the compiler generates only legal instructions. However, in case your program uses callbacks, this scenario could happen in case the program is trying to jump to an uninitialized callback address.
The last type of error which is covered in this article is the Floating point exception. If your program performed an illegal arithmetic operation (such as division by zero), the program will terminate and the message “Floating point exception” will appear on the console. Although the words “Floating point” are used, this type of error also refers to errors cause by arithmetic operations with integers.
What happens when the program performs an illegal operation?
When one of the above happens, the kernel sends a fault signal (or an exception) to the program. A fault signal is a special signal that tells the program that something bad has happened, and it needs to be terminated. There are four fault signals in the system:
- SIGSEGV: In case of Segmentation fault
- SIGBUS: In case of Alignment trap
- SIGILL: In case of Illegal instruction
- SIGFPE: In case of Illegal arithmetic operation.
Each signal needs to be handled and acknowledged. There are also the signals SIGQUIT and SIGINT which are not caused by an error, but also must be handled because the program needs to be terminated. In case the program did not install such handlers, there are default handlers for the fault signals that just write a short error message; “Segmentation fault“, and then terminate the program. Unfortunately, this is insufficient information in order to find and fix the bug, and the system may become unstable or unusable once the program was terminated. Furthermore, this short error message might get lost within the many other messages which appear on the console while the system is running. Note that this behavior is usually unaccepted in a project which is meant for mass production (the user must manually power cycle the unit in this case).
Enabling core dump
During debug, it is possible to enable core dump, which can be parsed off line on a host machine using gdb. The core dump contains useful information about the last known whereabouts of the program. Further information is available in the Enabling core dumps post.
Handling fault signals, and extracting information from them
In each system, it is crucial to install exception handlers for these fault signals, because you can never know when a program might crash (trust me; usually it happens in the customer’s premises or during some qualification tests). The PCD provides an easy and convenient way for registering its exception handlers, which provide a lot of useful debug information (See the next paragraph for more details). In case you want to write your own exception handler, you must consider the following issues:
- The exception can occur anytime, therefore, the handler must be written carefully.
- Many ANSI-C functions are not signal-safe (including printf( )…), therefore, your signal handler must use only signal safe functions. A list of signal safe functions can be found here.
- Don’t try to call your main( ) function from your signal handler. It may appear that you’ve revived your program, but it is unclear what will be the consequences, because your program’s stack or heap could have been corrupted.
- Your exception handler must call exit( ) once it has completed its work.
The next step is to write the exception handler. There are two types of exception handlers you can use; a standard exception handler which receives only the signal number, or an enhanced exception handler, which can also receive some more information, which may defer between architectures: a pointer to a siginfo_t and a pointer to a ucontext_t (casted to void *). Let’s take a look at thesigaction structure, which is defined in signal.h:
In case we want a simple handler, we can specify our handler in the sa_handler field, and in case we want the enhanced handler, we define SA_SIGINFO flag in sa_flags and specify our handler in sa_sigaction. The following macros can be used for registering exception handlers:
The first macro is for the simple handler and the second macro is for the enhanced handler. The macros use the sigaction( )function.
Once your handler has been activated, it means that an illegal operation has occurred. If you enabled the enhanced handler, it is possible to read the siginfo_t structure which contains information about the crash. This structure is large and contains a number of unions. I would like to mention only the most relevant members:
This structure provides information about the signal number and its code (See the siginfo.h header code for textual information about each signal code), the last known errno and the address which caused the error condition in the si_addr pointer. This address in this case is the address that the CPU was trying to access and not an address of the instruction (which is held by the Program Counter). The ucontext_t structure contains a list of the core register sets and their last known values (such as the Program Counter). It varies between architectures.
What do we do with this information?
We can now understand the nature and root cause of the error. We can also understand what the fault address that caused the error is. From the ucontext_t structure, we can extract the Program Counter (for the last known execution address) and the Link register (for the return address in ARM architecture). Theoretically speaking, we could print this information to the console usingprintf( ) function. However, if you remember, this function is not signal-safe, and therefore, it cannot be used, although I have seen some signal handlers that do use printf to print it. So how can we do it right? The solution is to have a crash daemon which listens on a socket. We can use our signal handler to send this information to the socket, and the daemon will print it for us. The socket API is signal safe and could be used without a problem. There is and easier way to do it, and I’ll present it next. Let’s say for now that we extracted the value of the Program Counter from the ucontext_t structure, and that the CPU was executing the instruction in address 0×8548. Assuming we compiled our program with debug symbols, we can use the objdump utility and ask it to show the mixed assembly and C code (using the –S option) around this address. Here’s a snippet from the output:
The line in red caused the Segmentation Fault crash. Object file with symbols will be presented in mixed-mode of C and assembly. In case the object was compiled without symbols, the only information we could extract near the address is the function name.
It is also possible to extract the file name and line number from the address number using the addr2line utility. Here’s an example:
There are cases where the fault address is not inside the program’s code, but inside a shared library’s code. A shared library code address will be mapped much higher than the area of the program code. We could use the maps file in the proc filesystem to determine where the last command came from. However, in order to print it, we’ll have to reboot the system because once the program has crashed, its proc entry is already gone. After we have rebooted and figured out the new PID, we need to print its map file by the command “cat /proc/<PID>/maps”. Here’s an example output:
Look for the “x” symbol in the permission column for code sections. In this example, possible Program counter locations could theoretically reside in the ranges of 0×8000-0×9000, 0×4000000-0×4005000, 0x400e000-0×4023000, 0x402c000-0×4067000, and 0×4075000-0x407f000. Matching the Program Counter and Link register to one of these ranges will result with the faulty code section. We’ll see an example later.
In this example, we can see what the last executed command was, but not cause of the problem. The signal info structure also contains the signal code and the registers which could help us figure out what is the root cause. How can we extract this information easily? Well, just continue to read.
How can PCD help debugging, resolving and preventing crashes?
When a program crashes due to an exception, it will be terminated once the error message is displayed. This crash will not trigger any recovery action, and your system will probably because unstable or unusable. The PCD can help here in two fields:
Enhanced debugging capabilities and system recovery. Once registering to the PCD exception handlers, they will provide more information about this crash, including the Program counter, Link register (return address), all the other registers, last value of errnoand the maps file of the process, right before it was terminated, without the need to reboot the system. The latter will help you analyze the location of the error just be looking at the PCD’s error report. It will also trigger a recovery action once the crash was detected, and return the system to functional mode. The crash information is also saved on a non-volatile storage for later/offline analysis. Let’s take the piece of code from example 2, and instrument it with the PCD exception handlers (See how easily it is done):
Let’s configure a simple PCD rule to start an monitor a program:
Here is the output on the console once PCD has started this rule. Note that the selected recovery action here was “Reboot”, that’s why the system is rebooting right after the crash. Pay attention to the bolded red line:
As we can see, the details of this crash provided by the PCD can help you find and resolve this issue easily and quickly. Let’s extract the file name and line number from the address number using the mentioned addr2line utility:
Now let’s get back to the objdump’s output for a more detailed output:
We can see that the red line is the last instruction that the CPU performed before the crash, and the return address (by the Link Register) is marked by the orange line. In many cases, a function is called from various locations; the Link Register value tells us what the specific location is, and therefore, is also important. From the code, we can see that r4 is loaded with address 0x857c and that the value in this address is sent to our rte_test_ptr function. We can use the objdump to lookup the variable’s name by grepping on the value of the word:
Now we also know the variable name, although it was possible because this variable was global and not on the stack.
Suppose the faulty function is inside a shared library and not in our program. How can we debug this?
Let’s repeat the example, and place the rte_test_ptr inside a shared library. After we compile and link, we run the program again. We’ll reexamine the crash log:
Now we can see that unlike the previous example, the Program Counter is in a high address (0x0402c4a4 according to the log). We understand that it resides outside of the program’s code and it is somewhere in one of the linked shared libraries. As we already saw, we can determine which library executed the bad instruction by matching the PC value to the executable address ranges in the maps file. In this example, this address is in the execution range of /lib/libsegv.so, which is the library made for the purpose of this example. In order to understand where we can find the problem inside the library, we need to calculate the offset by reducing the library’s base address from the PC value. In this example, we do: 0x0402c4a4 – 0x402c000 = 0x4A4, and this is the address of the problematic code inside the library. Let’s use objdump utility again, but now we’ll specify the library name, and not the program’s name. We can truncate the output by using the grep utility, to match the address we calculated and a few lines before and after the match (use –A and –B options):
In red we see the instruction that caused the crash, as expected, inside the rte_test_ptr function we moved to a shared library.
For conclusion, now we know how to:
- Extract and understand the crash information provided in to a fault signal handler.
- Find a bad instruction inside our program and inside a shared library.
Now we have all the required information to fix this crash. It can be done manually, and it can be done using the PCD.
Memory corruption
Crashes and segmentation faults may be also a result of memory corruption. There could be a case where some code unintentionally changes a memory portion which it does not own thus causing a mess to the rightful owner. Read here how to debug such errors.
Resources:
http://linux.die.net/man/2/signal
http://linux.die.net/man/2/sigaction
http://sourceforge.net/projects/pcd/
Check out the ads, there could be something that may interest you there. The ads revenue helps me to pay for the domain and storage. |