Resolving/Debugging user space crashes and segmentation faults

最新推荐文章于 2023-03-24 20:32:49 发布

咕噜咕噜斯基

最新推荐文章于 2023-03-24 20:32:49 发布

阅读量1.8k

点赞数

分类专栏： C/C++

C/C++ 专栏收录该内容

89 篇文章 0 订阅

订阅专栏

By Hai Shalom

原文地址: http://www.rt-embedded.com/blog/archives/resolving-crashes-and-segmentation-faults/

==============================================================================================================================

What is a segmentation fault?

Segmentation fault is the most common error condition where your program tries to access either an invalid memory location, or a memory location it is not allowed to access.

A few examples for this could be:

Dereferencing a NULL pointer.
Dereferencing an uninitialized pointer.
Accessing memory with a wrong alignment.
Writing to a read only area.
Writing or reading beyond program allocated resources (buffer overflow).
Memory corruption/overrun.

For our examples, let’s use the following function that simply writes the character ‘R’ to a given location provided by ptr:

void rte_test_ptr( char *ptr )
{
    *ptr = 'R';
}

All the following code examples will cause the program to terminate and the message “Segmentation Fault” to appear on the console.

Example 1: Dereference a NULL pointer:

int main( int argc , char *argv[] )
{
    rte_test_ptr(NULL);
    return 0;
}

Example 2: Write to a read-only location:

char *ro_ptr = "RT-Embedded";

int main( int argc , char *argv[] )
{
    rte_test_ptr(ro_ptr);
    return 0;
}

Example 3: Dereference an uninitialized pointer:

int main( int argc , char *argv[] )
{
    char *uninit_ptr;

    rte_test_ptr(uninit_ptr);
    return 0;
}

What other crash types we usually see?

Another common error is alignment trap (also referred as bus error). A bus error occurs when the CPU tries to access a 32-bit or 16-bit variable in an unaligned memory address.

See the article about alignment traps for more details.

Another type of crash is caused be an illegal instruction. Normally, this cannot happen because the compiler generates only legal instructions. However, in case your program uses callbacks, this scenario could happen in case the program is trying to jump to an uninitialized callback address.

The last type of error which is covered in this article is the Floating point exception. If your program performed an illegal arithmetic operation (such as division by zero), the program will terminate and the message “Floating point exception” will appear on the console. Although the words “Floating point” are used, this type of error also refers to errors cause by arithmetic operations with integers.

What happens when the program performs an illegal operation?

When one of the above happens, the kernel sends a fault signal (or an exception) to the program. A fault signal is a special signal that tells the program that something bad has happened, and it needs to be terminated. There are four fault signals in the system:

SIGSEGV: In case of Segmentation fault
SIGBUS: In case of Alignment trap
SIGILL: In case of Illegal instruction
SIGFPE: In case of Illegal arithmetic operation.

Each signal needs to be handled and acknowledged. There are also the signals SIGQUIT and SIGINT which are not caused by an error, but also must be handled because the program needs to be terminated. In case the program did not install such handlers, there are default handlers for the fault signals that just write a short error message; “Segmentation fault“, and then terminate the program. Unfortunately, this is insufficient information in order to find and fix the bug, and the system may become unstable or unusable once the program was terminated. Furthermore, this short error message might get lost within the many other messages which appear on the console while the system is running. Note that this behavior is usually unaccepted in a project which is meant for mass production (the user must manually power cycle the unit in this case).

Enabling core dump

During debug, it is possible to enable core dump, which can be parsed off line on a host machine using gdb. The core dump contains useful information about the last known whereabouts of the program. Further information is available in the Enabling core dumps post.

Handling fault signals, and extracting information from them

In each system, it is crucial to install exception handlers for these fault signals, because you can never know when a program might crash (trust me; usually it happens in the customer’s premises or during some qualification tests). The PCD provides an easy and convenient way for registering its exception handlers, which provide a lot of useful debug information (See the next paragraph for more details). In case you want to write your own exception handler, you must consider the following issues:

The exception can occur anytime, therefore, the handler must be written carefully.
Many ANSI-C functions are not signal-safe (including printf( )…), therefore, your signal handler must use only signal safe functions. A list of signal safe functions can be found here.
Don’t try to call your main( ) function from your signal handler. It may appear that you’ve revived your program, but it is unclear what will be the consequences, because your program’s stack or heap could have been corrupted.
Your exception handler must call exit( ) once it has completed its work.

The next step is to write the exception handler. There are two types of exception handlers you can use; a standard exception handler which receives only the signal number, or an enhanced exception handler, which can also receive some more information, which may defer between architectures: a pointer to a siginfo_t and a pointer to a ucontext_t (casted to void *). Let’s take a look at thesigaction structure, which is defined in signal.h:

/* Structure describing the action to be taken when a signal arrives.  */

struct sigaction
{
    /* Signal handler.  */

    union     {
         /* Used if SA_SIGINFO is not set.  */
         /* Type of a signal handler.  */
         typedef void (*sa_handler) (int);

         /* Used if SA_SIGINFO is set.  */
         void (*sa_sigaction) (int, siginfo_t *, void *);

    }  __sigaction_handler;

    /* Additional set of signals to be blocked.  */
    __sigset_t sa_mask;

    /* Special flags.  */
    int sa_flags;

    /* Restore handler.  */
    void (*sa_restorer) (void);

};

In case we want a simple handler, we can specify our handler in the sa_handler field, and in case we want the enhanced handler, we define SA_SIGINFO flag in sa_flags and specify our handler in sa_sigaction. The following macros can be used for registering exception handlers:

#define SETSIG(sa, sig, func) \
    {    memset( &sa, 0, sizeof( struct sigaction ) ); \
         sa.sa_handler = func; \
         sa.sa_flags = SA_RESTART; \
         sigaction(sig, &sa, 0L); \
    }

#define SETSIGINFO(sa, sig, func) \
    {    memset( &sa, 0, sizeof( struct sigaction ) ); \
         sa.sa_sigaction = func; \
         sa.sa_flags = SA_RESTART | SA_SIGINFO; \
         sigaction(sig, &sa, 0L); \
    }

The first macro is for the simple handler and the second macro is for the enhanced handler. The macros use the sigaction( )function.

Once your handler has been activated, it means that an illegal operation has occurred. If you enabled the enhanced handler, it is possible to read the siginfo_t structure which contains information about the crash. This structure is large and contains a number of unions. I would like to mention only the most relevant members:

typedef struct siginfo
{
    int si_signo;   /* Signal number.  */
    int si_errno;   /* If non-zero, an errno value associated with this signal,
                       as defined in <errno.h>.  */
    int si_code;    /* Signal code.  */

    /* SIGILL, SIGFPE, SIGSEGV, SIGBUS.  */
 struct {
       void *si_addr;  /* Faulting insn/memory ref.  */

    } _sigfault;

} siginfo_t;

This structure provides information about the signal number and its code (See the siginfo.h header code for textual information about each signal code), the last known errno and the address which caused the error condition in the si_addr pointer. This address in this case is the address that the CPU was trying to access and not an address of the instruction (which is held by the Program Counter). The ucontext_t structure contains a list of the core register sets and their last known values (such as the Program Counter). It varies between architectures.

What do we do with this information?

We can now understand the nature and root cause of the error. We can also understand what the fault address that caused the error is. From the ucontext_t structure, we can extract the Program Counter (for the last known execution address) and the Link register (for the return address in ARM architecture). Theoretically speaking, we could print this information to the console usingprintf( ) function. However, if you remember, this function is not signal-safe, and therefore, it cannot be used, although I have seen some signal handlers that do use printf to print it. So how can we do it right? The solution is to have a crash daemon which listens on a socket. We can use our signal handler to send this information to the socket, and the daemon will print it for us. The socket API is signal safe and could be used without a problem. There is and easier way to do it, and I’ll present it next. Let’s say for now that we extracted the value of the Program Counter from the ucontext_t structure, and that the CPU was executing the instruction in address 0×8548. Assuming we compiled our program with debug symbols, we can use the objdump utility and ask it to show the mixed assembly and C code (using the –S option) around this address. Here’s a snippet from the output:

# armeb-linux-uclibceabi-objdump -S segv

...
00008544 <rte_test_ptr>:
#include <stdio.h>
#include <pcdapi.h>

void rte_test_ptr( char *ptr )
{
    *ptr = 'R';
    8544:       e3a03052        mov     r3, #82 ; 0x52
    8548:       e5c03000        strb    r3, [r0] }
    854c:       e12fff1e        bx      lr
...

The line in red caused the Segmentation Fault crash. Object file with symbols will be presented in mixed-mode of C and assembly. In case the object was compiled without symbols, the only information we could extract near the address is the function name.

It is also possible to extract the file name and line number from the address number using the addr2line utility. Here’s an example:

# armeb-linux-uclibceabi-addr2line -e segv -f 8548
rte_test_ptr
/home/hai/rte/segv.c:7

There are cases where the fault address is not inside the program’s code, but inside a shared library’s code. A shared library code address will be mapped much higher than the area of the program code. We could use the maps file in the proc filesystem to determine where the last command came from. However, in order to print it, we’ll have to reboot the system because once the program has crashed, its proc entry is already gone. After we have rebooted and figured out the new PID, we need to print its map file by the command “cat /proc/<PID>/maps”. Here’s an example output:

# cat /proc/204/maps
00008000-00009000 r-xp 00000000 1f:07 59         /usr/sbin/segv
00010000-00011000 rw-p 00000000 1f:07 59         /usr/sbin/segv
04000000-04005000 r-xp 00000000 1f:06 231        /lib/ld-uClibc-0.9.29.so
04005000-04007000 rw-p 04005000 00:00 0
0400c000-0400d000 r--p 00004000 1f:06 231        /lib/ld-uClibc-0.9.29.so
0400d000-0400e000 rw-p 00005000 1f:06 231        /lib/ld-uClibc-0.9.29.so
0400e000-04023000 r-xp 00000000 1f:06 175        /lib/libticc.so
04023000-0402a000 ---p 04023000 00:00 0
0402a000-0402c000 rw-p 00014000 1f:06 175        /lib/libticc.so
0402c000-04067000 r-xp 00000000 1f:06 200        /lib/libuClibc-0.9.29.so
04067000-0406e000 ---p 04067000 00:00 0
0406e000-0406f000 r--p 0003a000 1f:06 200        /lib/libuClibc-0.9.29.so
0406f000-04070000 rw-p 0003b000 1f:06 200        /lib/libuClibc-0.9.29.so
04070000-04075000 rw-p 04070000 00:00 0
04075000-0407f000 r-xp 00000000 1f:06 137        /lib/libgcc_s.so.1
0407f000-04086000 ---p 0407f000 00:00 0
04086000-04087000 rw-p 00009000 1f:06 137        /lib/libgcc_s.so.1
0ece0000-0ecf5000 rwxp 0ece0000 00:00 0          [stack]

Look for the “x” symbol in the permission column for code sections. In this example, possible Program counter locations could theoretically reside in the ranges of 0×8000-0×9000, 0×4000000-0×4005000, 0x400e000-0×4023000, 0x402c000-0×4067000, and 0×4075000-0x407f000. Matching the Program Counter and Link register to one of these ranges will result with the faulty code section. We’ll see an example later.

In this example, we can see what the last executed command was, but not cause of the problem. The signal info structure also contains the signal code and the registers which could help us figure out what is the root cause. How can we extract this information easily? Well, just continue to read.

How can PCD help debugging, resolving and preventing crashes?

When a program crashes due to an exception, it will be terminated once the error message is displayed. This crash will not trigger any recovery action, and your system will probably because unstable or unusable. The PCD can help here in two fields:

Enhanced debugging capabilities and system recovery. Once registering to the PCD exception handlers, they will provide more information about this crash, including the Program counter, Link register (return address), all the other registers, last value of errnoand the maps file of the process, right before it was terminated, without the need to reboot the system. The latter will help you analyze the location of the error just be looking at the PCD’s error report. It will also trigger a recovery action once the crash was detected, and return the system to functional mode. The crash information is also saved on a non-volatile storage for later/offline analysis. Let’s take the piece of code from example 2, and instrument it with the PCD exception handlers (See how easily it is done):

#include <stdio.h>
#include <pcdapi.h>

void rte_test_ptr( char *ptr )
{
    *ptr = 'R';
}

char *ro_ptr = "RT-Embedded";

int main( int argc , char *argv[] )
{
    /* Register to PCD's exception handlers */
    PCD_API_REGISTER_EXCEPTION_HANDLERS();

    /* Crash test */
    rte_test_ptr(ro_ptr);

    printf(ro_ptr);

    return 0;
}

Let’s configure a simple PCD rule to start an monitor a program:

RULE = TEST_SIGSEGV
START_COND = NONE COMMAND = /usr/sbin/segv
SCHED = NICE,0
DAEMON = YES END_COND = NONE END_COND_TIMEOUT = -1
FAILURE_ACTION = REBOOT ACTIVE = YES

Here is the output on the console once PCD has started this rule. Note that the selected recovery action here was “Reboot”, that’s why the system is rebooting right after the crash. Pay attention to the bolded red line:

pcd: Starting process /usr/sbin/segv (Rule TEST_SIGSEGV).
pcd: Rule TEST_SIGSEGV: Success (Process /usr/sbin/segv (204)).

**************************************************************************
**************************** Exception Caught ****************************
**************************************************************************

Signal information:

Time: Thu Jan  1 00:00:12 1970
Process name: /usr/sbin/segv
PID: 204
Fault Address: 0x00008590
Signal: Segmentation fault
Signal Code: Invalid permissions for mapped object
Last error: Success (0)
Last error (by signal): 0

ARM registers:

trap_no=0x0000000e
error_code=0x0000081f
oldmask=0x00000000
r0=0x00008590
r1=0x0ecf4ba4
r2=0x00000000
r3=0x00000052
r4=0x00010690
r5=0x00000000
r6=0x0000846c
r7=0x00008418
r8=0x00000000
r9=0x00000000
r10=0x00000000
fp=0x00000000
ip=0x00000000
sp=0x0ecf4cf0
lr=0x0000856c
pc=0x00008548
cpsr=0x40000010
fault_address=0x00008590

Maps file:

00008000-00009000 r-xp 00000000 1f:07 59         /usr/sbin/segv
00010000-00011000 rw-p 00000000 1f:07 59         /usr/sbin/segv
04000000-04005000 r-xp 00000000 1f:06 231        /lib/ld-uClibc-0.9.29.so
04005000-04007000 rw-p 04005000 00:00 0
0400c000-0400d000 r--p 00004000 1f:06 231        /lib/ld-uClibc-0.9.29.so
0400d000-0400e000 rw-p 00005000 1f:06 231        /lib/ld-uClibc-0.9.29.so
0400e000-04023000 r-xp 00000000 1f:06 175        /lib/libticc.so
04023000-0402a000 ---p 04023000 00:00 0
0402a000-0402c000 rw-p 00014000 1f:06 175        /lib/libticc.so
0402c000-04067000 r-xp 00000000 1f:06 200        /lib/libuClibc-0.9.29.so
04067000-0406e000 ---p 04067000 00:00 0
0406e000-0406f000 r--p 0003a000 1f:06 200        /lib/libuClibc-0.9.29.so
0406f000-04070000 rw-p 0003b000 1f:06 200        /lib/libuClibc-0.9.29.so
04070000-04075000 rw-p 04070000 00:00 0
04075000-0407f000 r-xp 00000000 1f:06 137        /lib/libgcc_s.so.1
0407f000-04086000 ---p 0407f000 00:00 0
04086000-04087000 rw-p 00009000 1f:06 137        /lib/libgcc_s.so.1
0ece0000-0ecf5000 rwxp 0ece0000 00:00 0          [stack]

**************************************************************************
pcd: Error: Process /usr/sbin/segv (204) exited unexpectedly (Rule TEST_SIGSEGV).
pcd: Terminating PCD, rebooting system...
starting pid 205, tty '': '/bin/umount /var /sys'
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Restarting system.

As we can see, the details of this crash provided by the PCD can help you find and resolve this issue easily and quickly. Let’s extract the file name and line number from the address number using the mentioned addr2line utility:

# armeb-linux-uclibceabi-addr2line -e segv -f 8548
rte_test_ptr
/home/hai/rte/segv.c:7

Now let’s get back to the objdump’s output for a more detailed output:

00008544 <rte_test_ptr>:
#include <stdio.h>
#include <pcdapi.h>

void rte_test_ptr( char *ptr )
{
    *ptr = 'R';
    8544:       e3a03052        mov     r3, #82 ; 0x52
    8548:       e5c03000        strb    r3, [r0] }
    854c:       e12fff1e        bx      lr

00008550 <main>:

char *ro_ptr = "RT-Embedded";

int main( int argc , char *argv[] )
{
    8550:       e92d4010        push    {r4, lr}
    /* Register to PCD's exception handlers */
    PCD_API_REGISTER_EXCEPTION_HANDLERS();

    /* Crash test */
    rte_test_ptr(ro_ptr);
    8554:       e59f4020        ldr     r4, [pc, #32]   ; 857c <main+0x2c>
char *ro_ptr = "RT-Embedded";

int main( int argc , char *argv[] )
{
    /* Register to PCD's exception handlers */
    PCD_API_REGISTER_EXCEPTION_HANDLERS();
    8558:       e5910000        ldr     r0, [r1]
    855c:       e3a01000        mov     r1, #0  ; 0x0
    8560:       ebffffb8        bl      8448 <_init+0x30>

    /* Crash test */
    rte_test_ptr(ro_ptr);
    8564:       e5940000        ldr     r0, [r4]
    8568:       ebfffff5        bl      8544 <rte_test_ptr>
    printf(ro_ptr);
    856c:       e5940000        ldr     r0, [r4]     8570:       ebffffb1        bl      843c <_init+0x24>

    return 0;
}
    8574:       e3a00000        mov     r0, #0  ; 0x0
    8578:       e8bd8010        pop     {r4, pc}
    857c:       00010690        .word   0x00010690

We can see that the red line is the last instruction that the CPU performed before the crash, and the return address (by the Link Register) is marked by the orange line. In many cases, a function is called from various locations; the Link Register value tells us what the specific location is, and therefore, is also important. From the code, we can see that r4 is loaded with address 0x857c and that the value in this address is sent to our rte_test_ptr function. We can use the objdump to lookup the variable’s name by grepping on the value of the word:

# armeb-linux-uclibceabi-objdump -x segv | grep 10690
00010690 g     O .data  00000004 ro_ptr

Now we also know the variable name, although it was possible because this variable was global and not on the stack.

Suppose the faulty function is inside a shared library and not in our program. How can we debug this?
Let’s repeat the example, and place the rte_test_ptr inside a shared library. After we compile and link, we run the program again. We’ll reexamine the crash log:

**************************************************************************
**************************** Exception Caught ****************************
**************************************************************************

Signal information:

Time: Thu Jan  1 00:01:11 1970
Process name: /usr/sbin/segv
PID: 206
Fault Address: 0x000085f4
Signal: Segmentation fault
Signal Code: Invalid permissions for mapped object
Last error: Success (0)
Last error (by signal): 0

ARM registers:

trap_no=0x0000000e
error_code=0x0000081f
oldmask=0x00000000
r0=0x000085f4
r1=0x0eb72b74
r2=0x00000000
r3=0x00000052
r4=0x04078000
r5=0x00000000
r6=0x000084ac
r7=0x0000844c
r8=0x00000000
r9=0x00000000
r10=0x00000000
fp=0x0eb72cd4
ip=0x0402c4a0
sp=0x0eb72cc0
lr=0x000085c0
pc=0x0402c4a4
cpsr=0x00000010
fault_address=0x000085f4

Maps file:

00008000-00009000 r-xp 00000000 1f:06 315        /usr/sbin/segv
00010000-00011000 rw-p 00000000 1f:06 315        /usr/sbin/segv
04000000-04005000 r-xp 00000000 1f:06 232        /lib/ld-uClibc-0.9.29.so
04005000-04007000 rw-p 04005000 00:00 0
0400c000-0400d000 r--p 00004000 1f:06 232        /lib/ld-uClibc-0.9.29.so
0400d000-0400e000 rw-p 00005000 1f:06 232        /lib/ld-uClibc-0.9.29.so
0400e000-04023000 r-xp 00000000 1f:06 175        /lib/libticc.so
04023000-0402a000 ---p 04023000 00:00 0
0402a000-0402c000 rw-p 00014000 1f:06 175        /lib/libticc.so
0402c000-0402d000 r-xp 00000000 1f:06 212        /lib/libsegv.so
0402d000-04034000 ---p 0402d000 00:00 0
04034000-04035000 rw-p 00000000 1f:06 212        /lib/libsegv.so
04035000-04070000 r-xp 00000000 1f:06 200        /lib/libuClibc-0.9.29.so
04070000-04077000 ---p 04070000 00:00 0
04077000-04078000 r--p 0003a000 1f:06 200        /lib/libuClibc-0.9.29.so
04078000-04079000 rw-p 0003b000 1f:06 200        /lib/libuClibc-0.9.29.so
04079000-0407e000 rw-p 04079000 00:00 0
0407e000-04088000 r-xp 00000000 1f:06 137        /lib/libgcc_s.so.1
04088000-0408f000 ---p 04088000 00:00 0
0408f000-04090000 rw-p 00009000 1f:06 137        /lib/libgcc_s.so.1
0eb5e000-0eb73000 rwxp 0eb5e000 00:00 0          [stack]

**************************************************************************
pcd: Error: Process /usr/sbin/segv (206) exited unexpectedly (Rule TEST_SIGSEGV).
pcd: Terminating PCD, rebooting system...
starting pid 207, tty '': '/bin/umount -l /nvram /var /sys'
The system is going down NOW!
Sent SIGTERM to all processes
Requesting system reboot
Restarting system.

Now we can see that unlike the previous example, the Program Counter is in a high address (0x0402c4a4 according to the log). We understand that it resides outside of the program’s code and it is somewhere in one of the linked shared libraries. As we already saw, we can determine which library executed the bad instruction by matching the PC value to the executable address ranges in the maps file. In this example, this address is in the execution range of /lib/libsegv.so, which is the library made for the purpose of this example. In order to understand where we can find the problem inside the library, we need to calculate the offset by reducing the library’s base address from the PC value. In this example, we do: 0x0402c4a4 – 0x402c000 = 0x4A4, and this is the address of the problematic code inside the library. Let’s use objdump utility again, but now we’ll specify the library name, and not the program’s name. We can truncate the output by using the grep utility, to match the address we calculated and a few lines before and after the match (use –A and –B options):

# armeb-linux-uclibceabi-objdump -S libsegv.so | grep 4a4 –B 4 –A 4

000004a0 <rte_test_ptr>:
void rte_test_ptr( char *ptr )
{
    *ptr = 'R';
 4a0:   e3a03052        mov     r3, #82 ; 0x52
 4a4:   e5c03000        strb    r3, [r0]
}
 4a8:   e12fff1e        bx      lr

In red we see the instruction that caused the crash, as expected, inside the rte_test_ptr function we moved to a shared library.

For conclusion, now we know how to:

Extract and understand the crash information provided in to a fault signal handler.
Find a bad instruction inside our program and inside a shared library.

Now we have all the required information to fix this crash. It can be done manually, and it can be done using the PCD.

Memory corruption

Crashes and segmentation faults may be also a result of memory corruption. There could be a case where some code unintentionally changes a memory portion which it does not own thus causing a mess to the rightful owner. Read here how to debug such errors.