Symbol resolution & relocation

In this article I'll explore the basic concepts associated with static linking - symbol resolution and relocation.

Computer Systems - A Programmer's Perspective | Chapter 7: Linking

7.2: Static Linking
7.3: Object Files
7.5: Symbols and Symbol Tables
7.6: Symbol Resolution
7.7: Relocation

Static Linking#

Static linkers such as the Linux ld program take as input a collection of relocatable object files and command-line arguments and generate as output a fully linked executable object file that can be loaded and run. The input relocatable object files consist of various code and data sections, where each section is a contiguous sequence of bytes. Instructions are in one section(e.g. .text), initialized global variables are in another section(e.g. .data), and uninitialized variables are in yet another section(e.g. .bss).

To build the executable, the linker must perform two main tasks:

Symbol resolution. Object files define and reference symbols, where each symbol corresponds to a function, a global variable, or a static variable (i.e., any C variable declared with the static attribute). The purpose of symbol resolution is to associate each symbol reference with exactly one symbol definition.
Relocation. Compilers and assemblers generate code and data sections that start at address 0. The linker relocates these sections by associating a memory location with each symbol definition, and then modifying all of the references to those symbols so that they point to this memory location. The linker blindly performs these relocations using detailed instructions, generated by the assembler, called relocation entries.

The sections that follow describe these tasks in more detail. As you read, keep in mind some basic facts about linkers: Object files are merely collections of blocks of bytes. Some of these blocks contain program code, others contain program data, and others contain data structures that guide the linker and loader. A linker concatenates blocks together, decides on run-time locations for the concatenated blocks, and modifies various locations within the code and data blocks. Linkers have minimal understanding of the target machine. The compilers and assemblers that generate the object files have already done most of the work.

Object Files#

Object files come in three forms:

Relocatable object file. Contains binary code and data in a form that can be combined with other relocatable object files at compile time to create an executable object file.
Executable object file. Contains binary code and data in a form that can be copied directly into memory and executed.
Shared object file. A special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run time.

Compilers and assemblers generate relocatable object files (including shared object files). Linkers generate executable object files. Technically, an object module is a sequence of bytes, and an object file is an object module stored on disk in a file. However, we will use these terms interchangeably.

Object files are organized according to specific object file formats, which vary from system to system. The first Unix systems from Bell Labs used the a.out format. (To this day, executables are still referred to as a.out files.) Windows uses the Portable Executable (PE) format. Mac OS-X uses the Mach-O format. Modern x86-64 Linux and Unix systems use Executable and Linkable Format (ELF). Although our discussion will focus on ELF, the basic concepts are similar, regardless of the particular format.

Type	Generator	Loadable	Suffixes
Relocatable	assembler	No	`.o`
Executable	linker	Yes
Shared	linker	No	`.so`

For target files in Linux, from a technical point of view, the file suffix can be arbitrary, or no suffix at all, and the usual file suffix is just for ease of distinction. In fact, in the Linux system, the target file type is defined in a field in the ELF header(field e_type of Elf32_Ehdr/Elf64_Ehdr), and the system determines the true type of the target file by the value of this field, rather than by the extension of the target file.

The generator in the above table refers specifically to the part directly involved in the process of generating the target file. For example, the executable target file generator is the linker, of course the compiler and assembler are also involved in this process, but the latter two are involved indirectly. Meanwhile, practice has shown that the linker is involved in the process of creating shareable object files. So we specify its generator as linker instead of the latter two.

Symbols#

Each relocatable object module, m, has a symbol table that contains information about the symbols that are defined and referenced by m. In the context of a linker, there are three different kinds of symbols:

Global symbols that are defined by module m and that can be referenced by other modules. Global linker symbols correspond to nonstatic C functions and global variables.
Global symbols that are referenced by module m but defined by some other module. Such symbols are called externals and correspond to nonstatic C functions and global variables that are defined in other modules.
Local symbols that are defined and referenced exclusively by module m. These correspond to static C functions and global variables that are defined with the static attribute. These symbols are visible anywhere within module m, but cannot be referenced by other modules.

local linker symbols vs. local program variables

It is important to realize that local linker symbols are not the same as local program variables. The symbol table in .symtab does not contain any symbols that correspond to local nonstatic program variables. These are managed at run time on the stack and are not of interest to the linker.

Interestingly, local procedure variables that are defined with the C static attribute are not managed on the stack. Instead, the compiler allocates space in .data or .bss for each definition and creates a local linker symbol in the symbol table with a unique name.

Symbol table#

Learning Linux Binary Analysis | Chapter 2: The ELF Binary Format - ELF symbols

Symbols are a symbolic reference to some type of data or code such as a global variable or function. For instance, the puts function is going to have a symbol entry that points to it in the dynamic symbol table .dynsym. In most shared libraries and dynamically linked executables, there exist two symbol tables. In the readelf -S output shown previously, you can see two sections: .dynsym and .symtab.

The .dynsym contains global symbols that reference symbols from an external source, such as libc functions like puts, whereas the symbols contained in .symtab will contain all of the symbols in .dynsym, as well as the local symbols for the executable, such as global variables, or local functions that you have defined in your code. So .symtab contains all of the symbols, whereas .dynsym contains just the dynamic/global symbols.

So the question is: Why have two symbol tables if .symtab already contains everything that's in .dynsym? If you check out the readelf -S output of an executable, you will see that some sections are marked A (ALLOC) or WA (WRITE/ ALLOC) or AX (ALLOC/EXEC). If you look at .dynsym, you will see that it is marked ALLOC, whereas .symtab has no flags.

ALLOC means that the section will be allocated at runtime and loaded into memory, and .symtab is not loaded into memory because it is not necessary for runtime.

The .dynsym contains symbols that can only be resolved at runtime, and therefore they are the only symbols needed at runtime by the dynamic linker. So, while the .dynsym symbol table is necessary for the execution of dynamically linked executables, the .symtab symbol table exists only for debugging and linking purposes and is often stripped (removed) from production binaries to save space.

Section Name	Contents	load into memory?	strippable
.symtab	all symbols	No	Yes
.dynsym	dynamic linking symbols	Yes	No

Symbol resolution#

Most modern programs are not fully isolated units but rather depend on functions imported from system and other libraries. The linker resolves symbol references by associating each reference with exactly one symbol definition from the symbol tables of its input relocatable object files. Symbol resolution is straightforward for references to local symbols that are defined in the same module as the reference. The compiler allows only one definition of each local symbol per module. The compiler also ensures that static local variables, which get local linker symbols, have unique names.

Resolving references to global symbols, however, is trickier. When the compiler encounters a symbol (either a variable or function name) that is not defined in the current module, it assumes that it is defined in some other module, generates a linker symbol table entry, and leaves it for the linker to handle. If the linker is unable to find a definition for the referenced symbol in any of its input modules, it prints an (often cryptic) error message and terminates.

Symbol resolution for global symbols is also tricky because multiple object modules might define global symbols with the same name. In this case, the linker must either flag an error or somehow choose one of the definitions and discard the rest. The approach adopted by Linux systems involves cooperation between the compiler, assembler, and linker and can introduce some baffling bugs to the unwary programmer.

Symbol relocation#

Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. In other words, relocatable files must have information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process's program image.

Relocations broadly fall into three broad categories:

Static relocations, which update pointers and dynamically rewrite instructions inside the program binary if the program has to be loaded at a nondefault address.
Dynamic relocations, which reference external symbols in a shared library dependency.
Thread-local relocations, which store the offset into the thread-local storage area for each thread that a given thread-local variable will use. We will look at thread-local storage later in this chapter.

So far, our focus is on static relocations.

Once the linker has completed the symbol resolution step, it has associated each symbol reference in the code with exactly one symbol definition (i.e., a symbol table entry in one of its input object modules). At this point, the linker knows the exact sizes of the code and data sections in its input object modules. It is now ready to begin the relocation step, where it merges the input modules and assigns run-time addresses to each symbol. Relocation consists of two steps:

Relocating sections and symbol definitions. In this step, the linker merges all sections of the same type into a new aggregate section of the same type. For example, the .data sections from the input modules are all merged into one section that will become the .data section for the output executable object file. The linker then assigns run-time memory addresses to the new aggregate sections, to each section defined by the input modules, and to each symbol defined by the input modules. When this step is complete, each instruction and global variable in the program has a unique run-time memory address.
Relocating symbol references within sections. In this step, the linker modifies every symbol reference in the bodies of the code and data sections so that they point to the correct run-time addresses. To perform this step, the linker relies on data structures in the relocatable object modules known as relocation entries, which we describe next.

When an assembler generates an object module, it does not know where the code and data will ultimately be stored in memory. Nor does it know the locations of any externally defined functions or global variables that are referenced by the module. So whenever the assembler encounters a reference to an object whose ultimate location is unknown, it generates a relocation entry that tells the linker how to modify the reference when it merges the object file into an executable. Relocation entries for code are placed in .rel.text(.rela.plt for DYN). Relocation entries for data are placed in .rel.data(.rela.dyn for DYN).

references#

Arm Assembly Internals and Reverse Engineering | Chapter 2 ELF File Format Internals

The Dynamic Section and Dynamic Loading - Program Relocations(Static, Dynamic, GOT, PLT).

Practical Binary Analysis | Chapter 2: The ELF Format

2.3.4 Lazy Binding and the .plt, .got, and .got.plt Sections
2.3.5 The .rel.* and .rela.* Sections
2.3.6 The .dynamic Section