Skip to content




C/C++ Programming »

Compilation Process

The compilation is a method whereby the source code is converted into object code. It is achieved with compiler assistance. The compiler tests the source code for syntactic or structural errors and produces the object code if the source code is error-free.

Last update: 2022-06-04


Following are the steps that a program goes through until it is translated into an executable form:

  1. Preprocessing
  2. Compilation
  3. Assembly
  4. Linking


Source code used in this guide:

compilation.zip

mylib.h
// declaration
int min(int a, int b);
mylib.c
// implementation
int min(int a, int b) {
    return (a < b) ? a : b;
}
header.h
// to look for min() function
#include "mylib.h"

// macros
#define SPEED_MAX   10  /* comment */
#define SPEED_INC   1   /* will be removed */
#define SPEED_UP(x) min((x) + SPEED_INC, SPEED_MAX)
source.c
#include <stdbool.h>
#include "header.h"

#ifndef SPEED_INIT
#define SPEED_INIT 0
#endif

int spd = SPEED_INIT;

void main() {
    while(true) {
        spd = SPEED_UP(spd);
    }
}

Overview of compilation in this example

Preprocessing#

The following works will be done by the preprocessor:

  1. Expand included files
  2. Substitute macros
  3. Remove disabled code and comments

It works on one C++ source file at a time. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with clever use of the #if and #error directives.


Find include files:

gcc -M source.c
/usr/include/stdc-predef.h /usr/lib/gcc/x86_64-linux-gnu/7/include/stdbool.h header.h mylib.h

stdc-predef.h contains definitions of global environment and primitives types


Output of Preprocessor:

  1. Content of mylib.h is copied to header.h, after that, the new content of header.h is copied into source.c. The content of stdbool.h is also copied into source.c

  2. Macros are expanded to the final definition. Definition defined in the command line will be generated and added to the source code in this step. If you don’t declare SPEED_INIT, the #ifndef directive will be activated, and SPEED_INIT is declared.

  3. All comments are removed

gcc -E source.c -DSPEED_INIT=5
# 1 "source.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "source.c"
# 1 "/usr/lib/gcc/x86_64-linux-gnu/7/include/stdbool.h" 1 3 4
# 2 "source.c" 2
# 1 "header.h" 1

# 1 "mylib.h" 1

int min(int a, int b);
# 3 "header.h" 2
# 3 "source.c" 2

int spd = 5;

void main() {
    while(
# 7 "source.c" 3 4
         1
# 7 "source.c"
             ) {
        spd = min((spd) + 1, 10);
    }
}


To see macros used in the source.c, run with -E -dU option:

Defined -DSPEED_INIT=5

gcc -E -dU source.c -DSPEED_INIT=5
# 3 "source.c" 2

int spd = 5;
#define SPEED_INIT 5

Undefined SPEED_INIT

gcc -E -dU source.c
# 3 "source.c" 2
#undef SPEED_INIT
int spd = 0;
#define SPEED_INIT 0

Compilation#

The compilation step is performed on each output of the preprocessor. The compiler parses the pure source code (now without any preprocessor directives) and converts it into assembly code.

You will see instructions to declare the object spd, the function main(), and a call to the function min(). Note that there is no implementation of the function min:.

gcc -S source.c -DSPEED_INIT=5
        .file   "source.c"
        .text
        .globl  spd                    # symbol spd
        .data
        .align 4
        .type   spd, @object           # is an object
        .size   spd, 4                 # size of int = 4
spd:                                   # definiation of spd
        .long   5                      # with init value is 5
        .text
        .globl  main                   # symbol main
        .type   main, @function        # is a function
main:                                  # definition of main
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
.L2:
        movl    spd(%rip), %eax        # load spd to register
        addl    $1, %eax               # add 1 to spd in the register
        movl    $10, %esi              # load 10 to register
        movl    %eax, %edi             # load calculated value of (spd+1)
        call    min@PLT                # call to min()
        movl    %eax, spd(%rip)        # save value back to spd
        jmp     .L2                    # loop back to L2 (while(true))
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0"
        .section        .note.GNU-stack,"",@progbits

Assembly#

The assembler creates an object written in machine code using a formatted structure (ELF, COFF, etc.). This object file contains the compiled code (in binary form) of the symbols defined in the input. Symbols in object files are referred to by name.

Object files can refer to symbols that are not defined. This is the case when you use a declaration, and don’t provide a definition for it.

All symbols and their definitions are listed, but not assigned to any address in the term of memory space. It means object file don’t provide information of where to find a symbol.

The produced object files can be put in special archives called static libraries, for easier reusing later on.

It’s at this stage that “regular” compiler errors, like syntax errors or failed overload resolution errors, are reported.

Compilers usually save all compiled object files after this point. This is very useful because with it you can compile each source code file separately. The advantage this provides is that you don’t need to recompile everything if you only change a single file.

Try this command:

gcc -c source.c mylib.c -DSPEED_INIT=5

You can not read the object using normal text editor anymore, as the file content is in binary. We have to use objdump tool.


Symbol Table:

objdump -t source.o

You can see spd object is located at the section data, the main function is at the section text, and the function min is undefined *UND*. All of them are not assigned to any address.

source.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000  source.c
0000000000000000 l    d  .text  0000000000000000  .text
0000000000000000 l    d  .data  0000000000000000  .data
0000000000000000 l    d  .bss   0000000000000000  .bss
0000000000000000 g     O .data  0000000000000004  spd
0000000000000000 g     F .text  0000000000000021  main
0000000000000000         *UND*  0000000000000000  _GLOBAL_OFFSET_TABLE_
0000000000000000         *UND*  0000000000000000  min

Disassembly:

objdump -D source.o
The instruction

source.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <main>:
   0:   55                   push   %rbp
   1:   48 89 e5             mov    %rsp,%rbp
   4:   8b 05 00 00 00 00    mov    0x0(%rip),%eax  # load spd
   a:   83 c0 01             add    $0x1,%eax       # add 1 to spd
   d:   be 0a 00 00 00       mov    $0xa,%esi       # load 10
  12:   89 c7                mov    %eax,%edi       # load (spd+1)
  14:   e8 00 00 00 00       callq  19 <main+0x19>  # call to min()
  19:   89 05 00 00 00 00    mov    %eax,0x0(%rip)  # save to spd
  1f:   eb e3                jmp    4 <main+0x4>    # loop back

Disassembly of section .data:

0000000000000000 <spd>:
   0:   05                   .byte 0x5              # spd = 5
   1:   00 00                 add    %al,(%rax)

Linking#

The linker is what produces the final compilation output from the object files the compiler produced. This output can be either a shared (or dynamic) library (and while the name is similar, they haven’t got much in common with static libraries mentioned earlier) or an executable.

It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, you need to tell the linker about them.

At this stage the most common errors are missing definitions or duplicate definitions. The former means that either the definitions don’t exist (i.e. they are not written), or that the object files or libraries where they reside were not given to the linker. The latter is obvious: the same symbol was defined in two different object files or libraries.

Run this command:

gcc source.o mylib.o -DSPEED_INIT=5

The output binary file can be inspected using objdump also.


Symbol Table:

objdump -t a.out

You will see the addresses are assigned to spd object, main, min functions:

a.out:     file format elf64-x86-64

SYMBOL TABLE:
...
0000000000000000 l    df *ABS*  0000000000000000  source.c
0000000000000000 l    df *ABS*  0000000000000000  mylib.c
0000000000000000 l    df *ABS*  0000000000000000  crtstuff.c
0000000000201010 g     O .data  0000000000000004  spd
000000000000061b g     F .text  0000000000000016  min
00000000000005fa g     F .text  0000000000000021  main
...

Disassembly:

objdump -D source.o

All functions, object have assigned addresses, therefore, the function call to min() is also completed by using the min() function’s address.

00000000000005fa <main>:
 5fa:   55                   push   %rbp
 5fb:   48 89 e5             mov    %rsp,%rbp
 5fe:   8b 05 0c 0a 20 00    mov    0x200a0c(%rip),%eax  # load spd
 604:   83 c0 01             add    $0x1,%eax            # add 1 to spd
 607:   be 0a 00 00 00       mov    $0xa,%esi            # load 10
 60c:   89 c7                mov    %eax,%edi            # load (spd+1)
 60e:   e8 08 00 00 00       callq  61b <min>            # call min(): ok
 613:   89 05 f7 09 20 00    mov    %eax,0x2009f7(%rip)  # save to spd
 619:   eb e3                jmp    5fe <main+0x4>       # loop back

000000000000061b <min>:
 61b:   55                   push   %rbp
 61c:   48 89 e5             mov    %rsp,%rbp
 61f:   89 7d fc             mov    %edi,-0x4(%rbp)      # pop spd
 622:   89 75 f8             mov    %esi,-0x8(%rbp)      # pod 10
 625:   8b 45 fc             mov    -0x4(%rbp),%eax      # load spd
 628:   39 45 f8             cmp    %eax,-0x8(%rbp)      # compare spd vs 10
 62b:   0f 4e 45 f8          cmovle -0x8(%rbp),%eax      # if less, use 10
 62f:   5d                   pop    %rbp
 630:   c3                   retq
 631:   66 2e 0f 1f 84 00 00 nopw   %cs:0x0(%rax,%rax,1)
 638:   00 00 00 
 63b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000201010 <spd>:
 201010:    05               .byte 0x5
 201011:    00 00            add    %al,(%rax)

Exercise#

Above guide show a case of static linking, from mylib.o to source.o.

How about the dynamic linking case?

Consider to use below simple program, how does compiler link the printf() function?

main.c
#include <stdio.h>

int main() {
    printf("Hello Workd!\n");
    return 0;
}

Further reading#

The output executable file is written in Executable and Linkable Format (ELF) format which is a common standard file format for executable files, object code, shared libraries, and core dumps.

By design, the ELF format is flexible, extensible, and cross-platform. For instance, it supports different endiannesses and address sizes, so it does not exclude any particular central processing unit (CPU) or instruction set architecture. This has allowed it to be adopted by many operating systems on different hardware platforms.

Use objectdump to inspect an executable file, and compare to the ELF format.

Comments