Chapter 7-01

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Linking is the process of collecting and combining various pieces of code and data into a single file that can be loaded (copied) into memory and executed.
Linking can be performed at compile time, when the source code is translated into machine code; at load time, when the program is loaded into memory and executed by the loader; and even at run time, by application programs.
On early computer systems, linking was performed manually. On modern systems, linking is performed automatically by programs called linkers.
Linkers play a crucial role in software development because they enable separate compilation. Instead of organizing a large application as one monolithic source file, we can decompose it into smaller, more manageable modules that can be modified and compiled separately. When we change one of these modules, we simply recompile it and relink the application, without having to recompile the other files.
7.1 Compiler Drivers

Most compilation systems provide a compiler driver that invokes the language preprocessor, compiler, assembler, and linker, as needed on behalf of the user. To build the example program using the GNU compilation system, we might invoke the gcc driver by typing the following command to the shell:
unix> gcc -O2 -g -o p main.c swap.c

The driver first runs the C preprocessor (cpp), which translates the C source file main.c into an ASCII intermediate file main.i:
cpp [other arguments] main.c /tmp/main.i
Next, the driver runs the C compiler (cc1), which translates main.i into an ASCII assembly language file main.s.
cc1 /tmp/main.i main.c -O2 [other arguments] -o /tmp/main.s
Then, the driver runs the assembler (as), which translates main.s into a relocatable object file main.o:
as [other arguments] -o /tmp/main.o /tmp/main.s
The driver goes through the same process to generate swap.o. Finally, it runs the linker program ld, which combines main.o and swap.o, along with the necessary system object files, to create the executable object file p:
ld -o p [system object files and args] /tmp/main.o /tmp/swap.o
To run the executable p, we type its name on the Unix shell’s command line:
unix> ./p
The shell invokes a function in the operating system called the loader, which copies the code and data in the executable file p into memory, and then transfers control to the beginning of the program.
7.2 Static Linking
Static linkers such as the Unix ld program take a collection of relocatable object files and command-line arguments as input and generate a fully linked executable object file that can be loaded and run as output. The input relocatable object files consist of various code and data sections. Instructions are in one section, initialized global variables are in another section, and uninitialized variables are in yet another section.
To build the executable, the linker must perform two main tasks:
1.Symbol resolution. Object files define and reference symbols. The purpose of symbol resolution is to associate each symbol reference with exactly one symbol definition.
2.Relocation. Compilers and assemblers generate code and data sections that start at address 0. The linker relocates these sections by associating a memory location with each symbol definition, and then modifying all of the references to those symbols so that they point to this memory location.
Object files are merely collections of blocks of bytes. Some of these blocks contain program code, others contain program data, and others contain data structures that guide the linker and loader.
A linker concatenates blocks together, decides on run-time locations for the concatenated blocks, and modifies various locations within the code and data blocks. Linkers have minimal understanding of the target machine.
The compilers and assemblers that generate the object files have already done most of the work.
7.3 Object Files
Object files come in three forms:
1.Relocatable object file.
Contains binary code and data in a form that can be combined with other relocatable object files at compile time to create an executable object file.
2.Executable object file.
Contains binary code and data in a form that can be copied directly into memory and executed.
3.Shared object file.
A special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run time.
Compilers and assemblers generate relocatable object files and shared object files. Linkers generate executable object files. Technically, an object module is a sequence of bytes, and an object file is an object module stored on disk in a file. However, we will use these terms interchangeably.
Object file formats vary from system to system. The first Unix systems from Bell Labs used the a.out format. (To this day, executables are still referred to as a.out files.) Modern Unix systems use the Unix Executable and Linkable Format (ELF). Although our discussion will focus on ELF, the basic concepts are similar, regardless of the particular format.
7.4 Relocatable Object Files

The ELF header begins with a 16-byte sequence that describes the word size and byte ordering of the system that generated the file. The rest of the ELF header contains information that allows a linker to interpret the object file. This includes the size of the ELF header, the object file type (e.g., relocatable, executable, or shared), the machine type (e.g., IA32), the file offset of the section header table, and the size and number of entries in the section header table.
The locations and sizes of the various sections are described by the section header table, which contains a fixed sized entry for each section in the object file.
Between the ELF header and the section header table are the sections themselves. A typical ELF relocatable object file contains the following sections:
.text: The machine code of the compiled program.
.rodata: Read-only data such as the format strings in printf statements, and jump tables for switch statements.
.data: Initialized global C variables. Local C variables are maintained at run time on the stack, and do not appear in either the .data or .bss sections.
.bss: Uninitialized global C variables. This section occupies no actual space in the object file; it is merely a place holder. Object file formats distinguish between initialized and uninitialized variables for space efficiency: uninitialized variables do not have to occupy any actual disk space in the object file.
.symtab: A symbol table with information about functions and global variables that are defined and referenced in the program. Unlike the symbol table inside a compiler, the .symtab symbol table does not contain entries for local variables.
.rel.text: A list of locations in the .text section that will need to be modified when the linker combines this object file with others. In general, any instruction that calls an external function or references a global variable will need to be modified. On the other hand, instructions that call local functions do not need to be modified. Relocation information is not needed in executable object files, and is usually omitted unless the user explicitly instructs the linker to include it.
.rel.data: Relocation information for any global variables that are referenced or defined by the module. In general, any initialized global variable whose initial value is the address of a global variable or externally defined function will need to be modified.
.debug: A debugging symbol table with entries for local variables and typedefs defined in the program, global variables defined and referenced in the program, and the original C source file. It is only present if the compiler driver is invoked with the -g option.
.line: A mapping between line numbers in the original C source program and machine code instructions in the .text section. It is only present if the compiler driver is invoked with the -g option.
.strtab: A string table for the symbol tables in the .symtab and .debug sections, and for the section names in the section headers. A string table is a sequence of null-terminated character strings.
Hiding variable and function names with static
Any global variable or function declared with the static attribute is private to its own module. Any global variable or function declared without the static attribute is public accessed by any other module. It is good programming practice to protect your variables and with the static attribute wherever possible.
7.5 Symbols and Symbol Tables
Each relocatable object module, m, has a symbol table that contains information about the symbols that are defined and referenced by m. In the context of a linker, there are three different kinds of symbols:
1.Global symbols that are defined by module m and that can be referenced by other modules. Global linker symbols correspond to nonstatic C functions and global variables that are defined without the C static attribute.
2.Global symbols that are referenced by module m but defined by some other module. Such symbols are called externals and correspond to C functions and variables that are defined in other modules.
3.Local symbols that are defined and referenced exclusively by module m. Some local linker symbols correspond to C functions and global variables that are defined with the static attribute. These symbols are visible anywhere within module m, but cannot be referenced by other modules. The sections in an object file and the name of the source file that corresponds to module m also get local symbols.
Note that local linker symbols are not the same as local program variables. The symbol table in .symtab does not contain any symbols that correspond to local nonstatic program variables. These are managed at run time on the stack and are not of interest to the linker.
Local procedure variables that are defined with the C static attribute are not managed on the stack. Instead, the compiler allocates space in .data or .bss for each definition and creates a local linker symbol in the symbol table with a unique name. For example, suppose a pair of functions in the same module define a static local variable x:

In this case, the compiler allocates space for two integers in .data and exports a pair of unique local linker symbols to the assembler. For example, it might use x.1 for the definition in function f and x.2 for the definition in function g.
Symbol tables are built by assemblers, using symbols exported by the compiler into the assembly-language .s file. An ELF symbol table is contained in the .symtab section. It contains an array of entries. Figure 7.4 shows the format of each entry.

The name is a byte offset into the string table that points to the null-terminated string name of the symbol.
The value is the symbol’s address. For relocatable modules, the value is an offset from the beginning of the section where the object is defined. For executable object files, the value is an absolute run-time address.
The size is the size (in bytes) of the object.
The type is usually either data or function. The symbol table can also contain entries for the individual sections and for the path name of the original source file. So there are distinct types for these objects as well.
The binding field indicates whether the symbol is local or global.
Each symbol is associated with some section of the object file, denoted by the section field, which is an index into the section header table. There are three special pseudo sections that don’t have entries in the section header table:
1.ABS is for symbols that should not be relocated.
2.UNDEF is for undefined symbols, that is, symbols that are referenced in this object module but defined elsewhere.
3.COMMON is for uninitialized data objects that are not yet allocated. For COMMON symbols, the value field gives the alignment requirement, and size gives the minimum size.
Here are the last three entries in the symbol table for main.o, as displayed by the GNU readelf tool. The first eight entries, which are not shown, are local symbols that the linker uses internally.

In this example, we see an entry for the definition of global symbol buf, an 8-byte object located at an offset (i.e., value) of zero in the .data section. This is followed by the definition of the global symbol main, a 17-byte function located at an offset of zero in the .text section. The last entry comes from the reference for the external symbol swap. Readelf identifies each section by an integer index. Ndx=1 denotes the .text section, and Ndx=3 denotes the .data section.
Symbol table entries for swap.o:
First, we see an entry for the definition of the global symbol bufp0, which is a 4-byte initialized object starting at offset 0 in .data. The next symbol comes from the reference to the external buf symbol in the initialization code for bufp0. This is followed by the global symbol swap, a 39-byte function at an offset of zero in .text. The last entry is the global symbol bufp1, a 4-byte uninitialized data object (with a 4-byte alignment requirement) that will eventually be allocated as a .bss object when this module is linked.

7.6 Symbol Resolution
The linker resolves symbol references by associating each reference with exactly one symbol definition from the symbol tables of its input relocatable object files.
Symbol resolution is straightforward for references to local symbols that are defined in the same module as the reference. The compiler allows only one definition of each local symbol per module. The compiler also ensures that static local variables, which get local linker symbols, have unique names.
Resolving references to global symbols is complex. When the compiler encounters a symbol (either a variable or function name) that is not defined in the current module, it assumes that it is defined in some other module, generates a linker symbol table entry, and leaves it for the linker to handle. If the linker is unable to find a definition for the referenced symbol in any of its input modules, it prints an error message and terminates.
For example, if we try to compile and link the following source file,

Then the compiler runs without errors, but the linker terminates when it cannot resolve the reference to foo:
unix> gcc -Wall -O2 -o linkerror linkerror.c
/tmp/ccSz5uti.o: In function ‘main’:
/tmp/ccSz5uti.o(.text+0x7): undefined reference to ‘foo’
collect2: ld returned 1 exit status
When the same symbol is defined by multiple object files, the linker must either flag an error or somehow choose one of the definitions and discard the rest. The approach adopted by Unix systems involves cooperation between the compiler, assembler, and linker, and can introduce some bugs to the unwary programmer.
Both C++ and Java allow overloaded methods that have the same name in the source code but different parameter lists. Overloaded functions in C++ and Java work because the compiler encodes each unique method and parameter list combination into a unique name for the linker. This encoding process is called mangling, and the inverse process demangling.
C++ and Java use compatible mangling schemes. A mangled class name consists of the integer number of characters in the name followed by the original name. For example, the class Foo is encoded as 3Foo. A method is encoded as the original method name, followed by __, followed by the mangled class name, followed by single letter encodings of each argument. For example, Foo::bar(int, long) is encoded as bar__3Fooil. Similar schemes are used to mangle global variable and template names.
7.6.1 How Linkers Resolve Multiply Defined Global Symbols
At compile time, the compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information in the symbol table of the relocatable object file. Functions and initialized global variables get strong symbols. Uninitialized global variables get weak symbols
Unix linkers use the following rules for dealing with multiply defined symbols:
Rule 1: Multiple strong symbols are not allowed.
Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol.
Rule 3: Given multiple weak symbols, choose any of the weak symbols.
For example, the linker will generate an error message for the following modules because the strong symbol x is defined twice (rule 1):
However, if x is uninitialized in one module, then the linker will quietly choose the strong symbol defined in the other (rule 2):
At run time, function f changes the value of x from 15213 to 15212. Notice that the linker normally gives no indication that it has detected multiple definitions of x:
unix> gcc -o foobar3 foo3.c bar3.c
unix> ./foobar3
x = 15212
The same thing can happen if there are two weak definitions of x (rule 3):
Consider the following example, in which x is defined as an int in one module and a double in another:
On an IA32/Linux machine, doubles are 8 bytes and ints are 4 bytes. Thus, the assignment x = -0.0 in line 6 of bar5.c will overwrite the memory locations for x and y (lines 5 and 6 in foo5.c) with the double-precision floating-point representation of negative zero!
linux> gcc -o foobar5 foo5.c bar5.c
linux> ./foobar5
x = 0x0 y = 0x80000000

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值