The 4 Stages of C Compilation
C is a compiled language, which means it is interpreted by the machine at “compile time” instead of upon execution.
In summary, the compilation can split into 4 stages:
- preprocessing
- compilation
- assembly
- linking
(Online Image)
To demonstrate these steps I’m going to be using the gcc
(GNU Compiler Collection) command to compile. Every compiler handles these steps, but may vary slightly in what they do during them.
This post will walk through each of the 4 stages of compiling the simplest “Hello World” C program:
/*
* hello.c
*/
#include <stdio.h>
#define HI "Hello, World"
// This command will be stripped after preprocessing
int main()
{
printf("%s\n", HI);
}
1. Preprocessing
C provides certain language facilities by means of a preprocessor, which is conceptionally a separate first step in compilation. In this stage, lines starting with #
character are interpreted by the preprocessor as preprocessor commands.
The most frequently used features are:
#include
, to include the contents of a file during compilation, and#define
, to replace a token by an arbitrary sequence of characters.- Other features include conditional compilation and macros with arguments.
Before interpreting commands, the preprocessor does some initial processing. This includes joining continued lines (lines ending with a \) and stripping comments.
To perform this step using gcc
, you can pass -E
option, and use -o
to ouput result to hello.i
file:
gcc -E hello.c -o hello.i
Contents in hello.i
:
// bunch of lines omitted for brevity
extern int __vsnprintf_chk (char * restrict, size_t, int, size_t,
const char * restrict, va_list);
# 408 "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h" 2 3 4
# 19 "hello.c" 2
int main()
{
printf("%s\n", "Hello World");
}
As you can see, the preprocessor just done simply copy-paste jobs for #include
and replace #define
macro HI
with "Hello World"
.
2. Compilation
The second stage of compilation is confusingly enough called compilation, where the preprocessor code is translated into human-readable assembly instructions.
Pass -S
option to perform this step:
gcc -S hello.i -o hello.s
Snippets of hello.s
:
_main: ## @main
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $16, %rsp
leaq L_.str(%rip), %rdi
leaq L_.str.1(%rip), %rsi
movb $0, %al
callq _printf
xorl %ecx, %ecx
movl %eax, -4(%rbp) ## 4-byte Spill
movl %ecx, %eax
addq $16, %rsp
popq %rbp
retq
.cfi_endproc
## -- End function
3. Assembly
During the third stage, Assembler is coming in and translating assembly instructions into object instructions (or machine instructions) which are simply 0’s and 1’s sequence looks like 0010111010010
.
Pass -c
option to gcc
:
gcc -c hello.s -o hello.o
Running the above command will generate a hello.o
object file. The contents of this file is in a binary format and can be inspected using hexdump
:
hexdump hello.o
It will look like:
0000000 cf fa ed fe 07 00 00 01 03 00 00 00 01 00 00 00
0000010 04 00 00 00 08 02 00 00 00 20 00 00 00 00 00 00
0000020 19 00 00 00 88 01 00 00 00 00 00 00 00 00 00 00
0000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000040 a0 00 00 00 00 00 00 00 28 02 00 00 00 00 00 00
0000050 a0 00 00 00 00 00 00 00 07 00 00 00 07 00 00 00
0000060 04 00 00 00 00 00 00 00 5f 5f 74 65 78 74 00 00
0000070 00 00 00 00 00 00 00 00 5f 5f 54 45 58 54 00 00
...
4. Linking
The object file is composed of machine instructions that the processor understands but some pieces of the program are out of order or missing. To produce an executable program, the existing pieces have to be rearranged and the missing ones filled in. This process is called linking.
The linker will arrange the pieces of object code so that functions in some pieces can successfully call functions in other ones. It will also add pieces containing the instructions for library functions used by the program.
In the case of the “Hello, World!” program, the linker will add the object code for the printf
function.
The result of the stage is the final executable program. When run gcc
without options, gcc
will name this file a.out
. To name the grogram something else, pass -o
option:
gcc hello.o -o helloword # link object files to a executable file
./helloword # run the program
Last but not least
As you have learned, the compilation can be explicitly separated into 4 stages, but you can compile source files directly into final excutable grogram using gcc
without options except -o
to rename the program, which is usually what we did in practice. gcc
will automatically do all the above mentioned 4 stages in one:
gcc hello.c -o helloword
./helloworld