The basics of compilers and a bit of their history

Read the following and the linked pages.

The history of compiling

The first real multi-purpose, the ENIAC, was programmed by flipping switches and moving wires. Computers have come a long way since then. They can now read programs directly off of a hard disk or other media and those programs can be written and compiled on the machine itself. Much of of this is thanks to the work of Eckert, Mauchley, and von Neumann. This idea of writing an extremely general computer and simply storing the program in memory is called the von Neumann architecture.

The first method of writing programs for these general-purpose machines was in assembly. An assembly language simply associates a human-readable name to each of the computer-specific instructions. While not as convenient as many modern languages, this is significantly easier than trying to work with the machine codes directly. Also, because the correspondence between assembly instructions and machine instructions is usually one-to-one, assembling a program (converting from assembly to the computer’s binary instruction format) is a very simple process and can be done by hand if needed.

The 1950’s saw the birth of the “modern” programming languages. This started with languages such as FORTRAN, COBOL, and LISP and many others followed. For more information read the Wikipedia article “History of programming languages.”

The basics of the compiling process

Compiling a program written in a language such as FORTRAN or C is a multi-step process. The exact details depend on the language or the compiler. In this discussion, I will assume the C language; however the process is similar for FORTRAN and other compiled languages. When a C program is compiled, it goes through the following four stages:

We will provide a brief discussion of each of these below. The preprocessing and linking stages will be discussed in more detail later.

Preprocessing

The first stage of compiling a C program is the preprocessing stage. The preprocessor takes your C code and transforms the text based on directives. A directive is a piece of code that tells the preprocessor what operations it should perform on the following lines. The preprocessor knows very little about the C language and the operations it performs are usually no more complicated than an advanced search-and-replace operation. We have already seen one preprocessor directive, the {{{#include}}} directive which we will discuss more in the next section.

Compiling

The compiling stage is by far the most complicated stage of the process. This is where the compiler takes the C code (as outputted by the preprocessor) and turns it into assembly. Translating from C to assembly is usually not a one-to-one process. For example, instead of having if, for, and while statements, the processor usually has a series of branch statements that say “If X condition holds, go to instruction Y”. The compiler then has to take the higher-level concepts of if, for, and while and write them in terms of branch instructions.

The compiling stage is also where the exact data-type details are handled. This includes computing sizes of variables and laying out structures and stack frames in memory.

Assembling

Assembling is the fairly simple process of translating assembly instructions into binary code. Because most assembly instructions correspond directly to a binary instruction the computer can understand, much of the assembly process is a direct translation. Some assembly languages allow for macros which allow you to pack multiple instructions into one. However, even there, these macros are usually very simple to expand.

The other task accomplished by the assembler is to actually compute the addresses of things. Once a program is compiled, the computer has basically no knowledge of variable or function names, only their memory addresses. Instead of doing everything in terms of hard addresses, the compiler will insert labels for variables or functions. After the assembly is expanded to binary code, the assembler goes back through and computes the exact addresses corresponding to these labels and puts them in the code where needed.

Linking

The final stage of compiling is called linking. Once the code has been compiled and assembled into binary, it is linked together with other pieces of complied code as well as external libraries. A library is a piece of code that is designed to be re-used in multiple programs. For example, every C program you compile is linked against the standard C library. Let’s take another look at our very first example:

#include <stdio.h>

int main(int argc, char **argv)
{
    /* Print a message to the screen */
    printf("Hello World!");

    return 0;
}

Where did that printf function come from? We never defined it. At the time, I waved my hands and said it had something to do with stdio.h. While stdio.h is an important part of it, the function is actually defined and its code resides in the C standard library. In order to be able to execute the printf function, our code has to be linked against this library. We will talk about libraries more in the future.