The LLVM Compiler Driver (llvmc)

NOTE: This document is a work in progress!

  1. Abstract
  2. Introduction
    1. Purpose
    2. Operation
    3. Phases
    4. Actions
  3. Details
  4. Configuration
  5. Glossary

Written by Reid Spencer

Abstract

This document describes the requirements, design, and configuration of the LLVM compiler driver, llvmc. The compiler driver knows about LLVM's tool set and can be configured to know about a variety of compilers for source languages. It uses this knowledge to execute the tools necessary to accomplish general compilation, optimization, and linking tasks. The main purpose of llvmc is to provide a simple and consistent interface to all compilation tasks. This reduces the burden on the end user who can just learn to use llvmc instead of the entire LLVM tool set and all the source language compilers compatible with LLVM.

Introduction

The llvmc tool is a configurable compiler driver. As such, it isn't the compiler, optimizer, or linker itself but it drives (invokes) other software that perform those tasks. If you are familiar with the GNU Compiler Collection's gcc tool, llvmc is very similar.

The following introductory sections will help you understand why this tool is necessary and what it does.

Purpose

llvmc was invented to make compilation with LLVM based compilers easier. To accomplish this, llvmc strives to:

Additionally, llvmc makes it easier to write a compiler for use with LLVM, because it:

Operation

At a high level, llvmc operation is very simple. The basic action taken by llvmc is to simply invoke some tool or set of tools to fill the user's request for compilation. Every execution of llvmctakes the following sequence of steps:

Collect Command Line Options
The command line options provide the marching orders to llvmc on what actions it should perform. This is the request the user is making of llvmc and it is interpreted first. See the llvmc manual page for details on the options.
Read Configuration Files
Based on the options and the suffixes of the filenames presented, a set of configuration files are read to configure the actions llvmc will take. Configuration files are provided by either LLVM or the front end compiler tools that B invokes. These files determine what actions llvmc will take in response to the user's request. See the section on configuration for more details.
Determine Phases To Execute
Based on the command line options and configuration files, llvmc determines the compilation phases that must be executed by the user's request. This is the primary work of llvmc.
Determine Actions To Execute
Each phase to be executed can result in the invocation of one or more actions. An action is either a whole program or a function in a dynamically linked shared library. In this step, llvmc determines the sequence of actions that must be executed. Actions will always be executed in a deterministic order.
Execute Actions
The actions necessary to support the user's original request are executed sequentially and deterministically. All actions result in either the invocation of a whole program to perform the action or the loading of a dynamically linkable shared library and invocation of a standard interface function within that library.
Termination
If any action fails (returns a non-zero result code), llvmc also fails and returns the result code from the failing action. If everything succeeds, llvmc will return a zero result code.

llvmc's operation must be simple, regular and predictable. Developers need to be able to rely on it to take a consistent approach to compilation. For example, the invocation:

   llvmc -O2 x.c y.c z.c -o xyz

must produce exactly the same results as:

   llvmc -O2 x.c
   llvmc -O2 y.c
   llvmc -O2 z.c
   llvmc -O2 x.o y.o z.o -o xyz

To accomplish this, llvmc uses a very simple goal oriented procedure to do its work. The overall goal is to produce a functioning executable. To accomplish this, llvmc always attempts to execute a series of compilation phases in the same sequence. However, the user's options to llvmc can cause the sequence of phases to start in the middle or finish early.

Phases

llvmc breaks every compilation task into the following five distinct phases:

Preprocessing
Not all languages support preprocessing; but for those that do, this phase can be invoked. This phase is for languages that provide combining, filtering, or otherwise altering with the source language input before the translator parses it. Although C and C++ are the most common users of this phase, other languages may provide their own preprocessor (whether its the C pre-processor or not).
Translation
The translation phase converts the source language input into something that LLVM can interpret and use for downstream phases. The translation is essentially from "non-LLVM form" to "LLVM form".
Optimization
Once an LLVM Module has been obtained from the translation phase, the program enters the optimization phase. This phase attempts to optimize all of the input provided on the command line according to the options provided.
Linking
The inputs are combined to form a complete program.

The following table shows the inputs, outputs, and command line options applicabe to each phase.

Phase Inputs Outputs Options
Preprocessing
  • Source Language File
  • Source Language File
-E
Stops the compilation after preprocessing
Translation
  • Source Language File
  • LLVM Assembly
  • LLVM Bytecode
  • LLVM C++ IR
-c
Stops the compilation after translation so that optimization and linking are not done.
-S
Stops the compilation before object code is written so that only assembly code remains.
Optimization
  • LLVM Assembly
  • LLVM Bytecode
  • LLVM Bytecode
-Ox
This group of options affects the amount of optimization performed.
Linking
  • LLVM Bytecode
  • Native Object Code
  • LLVM Library
  • Native Library
  • LLVM Bytecode Executable
  • Native Executable
-L
Specifies a path for library search.
-l
Specifies a library to link in.
Actions

An action, with regard to llvmc is a basic operation that it takes in order to fulfill the user's request. Each phase of compilation will invoke zero or more actions in order to accomplish that phase.

Actions come in two forms:

  1. Invokable Executables
  2. Functions in a shared library
Details
Configuration

This section of the document describes the configuration files used by llvmc. Configuration information is relatively static for a given release of LLVM and a front end compiler. However, the details may change from release to release of either. Users are encouraged to simply use the various options of the B command and ignore the configuration of the tool. These configuration files are for compiler writers and LLVM developers. Those wishing to simply use B don't need to understand this section but it may be instructive on how the tool works.

Overview

llvmc is highly configurable both on the command line and in configuration files. The options it understands are generic, consistent and simple by design. Furthermore, the llvmc options apply to the compilation of any LLVM enabled programming language. To be enabled as a supported source language compiler, a compiler writer must provide a configuration file that tells llvmc how to invoke the compiler and what its capabilities are. The purpose of the configuration files then is to allow compiler writers to specify to llvmc how the compiler should be invoked. Users may but are not advised to alter the compiler's llvmc configuration.

Because llvmc just invokes other programs, it must deal with the available command line options for those programs regardless of whether they were written for LLVM or not. Furthermore, not all compilation front ends will have the same capabilities. Some front ends will simply generate LLVM assembly code, others will be able to generate fully optimized byte code. In general, llvmc doesn't make any assumptions about the capabilities or command line options of a sub-tool. It simply uses the details found in the configuration files and leaves it to the compiler writer to specify the configuration correctly.

This approach means that new compiler front ends can be up and working very quickly. As a first cut, a front end can simply compile its source to raw (unoptimized) bytecode or LLVM assembly and llvmc can be configured to pick up the slack (translate LLVM assembly to bytecode, optimize the bytecode, generate native assembly, link, etc.). In fact, the front end need not use any LLVM libraries, and it could be written in any language (instead of C++). The configuration data will allow the full range of optimization, assembly, and linking capabilities that LLVM provides to be added to these kinds of tools. Enabling the rapid development of front-ends is one of the primary goals of llvmc.

As a compiler front end matures, it may utilize the LLVM libraries and tools to more efficiently produce optimized bytecode directly in a single compilation and optimization program. In these cases, multiple tools would not be needed and the configuration data for the compiler would change.

Configuring llvmc to the needs and capabilities of a source language compiler is relatively straight forward. A compiler writer must provide a definition of what to do for each of the five compilation phases for each of the optimization levels. The specification consists simply of prototypical command lines into which llvmc can substitute command line arguments and file names. Note that any given phase can be completely blank if the source language's compiler combines multiple phases into a single program. For example, quite often pre-processing, translation, and optimization are combined into a single program. The specification for such a compiler would have blank entries for pre-processing and translation but a full command line for optimization.

Configuration Files

Types of Files

There are two types of configuration files: the master configuration file and the language specific configuration file. The master configuration file contains the general configuration of llvmc itself and is supplied with the tool. It contains information that is source language agnostic. Language specific configuration files tell llvmc how to invoke the language's compiler for a variety of different tasks and what other tools are needed to backfill the compiler's missing features (e.g. optimization).

Directory Search

llvmc always looks for files of a specific name. It uses the first file with the name its looking for by searching directories in the following order:

  1. Any directory specified by the --config-dir option will be checked first.
  2. If the environment variable LLVM_CONFIG_DIR is set, and it contains the name of a valid directory, that directory will be searched next.
  3. If the user's home directory (typically /home/user contains a sub-directory named .llvm and that directory contains a sub-directory named etc then that directory will be tried next.
  4. If the LLVM installation directory (typically /usr/local/llvm contains a sub-directory named etc then that directory will be tried last.
  5. If the configuration file sought still can't be found, llvmc will print an error message and exit.
The first file found in this search will be used. Other files with the same name will be ignored even if they exist in one of the subsequent search locations.

File Names

In the directories searched, a file named master will be recognized as the master configuration file for llvmc. Note that users may override the master file with a copy in their home directory but they are advised not to. This capability is only useful for compiler implementers needing to alter the master configuration while developing their compiler front end. When reading the configuration files, the master files are always read first.

Language specific configuration files are given specific names to foster faster lookup. The name of a given language specific configuration file is the same as the suffix used to identify files containing source in that language. For example, a configuration file for C++ source might be named cpp, C, or cxx.

What Gets Read

The master configuration file is always read. Which language specific configuration files are read depends on the command line options and the suffixes of the file names provided on llvmc's command line. Note that the --x LANGUAGE option alters the language that llvmc uses for the subsequent files on the command line. Only the language specific configuration files actually needed to complete llvmc's task are read. Other language specific files will be ignored.

Syntax

The syntax of the configuration files is yet to be determined. There are two viable options remaining:

Master Configuration Items

=head3 Section: [lang=I]

This section provides the master configuration data for a given language. The
language specific data will be found in a file named I.

=over

=item CI

This adds the I specified to the list of recognized suffixes for
the I identified in the section. As many suffixes as are commonly used
for source files for the I should be specified. 

=back

=begin html

For example, the following might appear for C++:


[lang=C++]
suffix=.cpp
suffix=.cxx
suffix=.C

=end html
Language Specific Configuration Items
=head3 Section: [general]

=over

=item C

This item specifies whether the language has a pre-processing phase or not. This
controls whether the B<-E> option works for the language or not.

=item C

This item specifies the kind of output the language's compiler generates. The
choices are either bytecode (C) or LLVM assembly (C).

=back

=head3 Section: [-O0]

=over

=item CI

This item specifies the I to use for pre-processing the input.

=over

Valid substitutions for this item are:

=item %in%

The input source file.

=item %out%

The output file.

=item %options%

Any pre-processing specific options (e.g. B<-I>).

=back

=item CI

This item specifies the I to use for translating the source
language input into the output format given by the C item.

=item CI

This item specifies the I for optimizing the translator's output.

=back
Glossary

This document uses precise terms in reference to the various artifacts and concepts related to compilation. The terms used throughout this document are defined below.

assembly
A compilation phase in which LLVM bytecode or LLVM assembly code is assembled to a native code format (either target specific aseembly language or the platform's native object file format).
compiler
Refers to any program that can be invoked by llvmc to accomplish the work of one or more compilation phases.
driver
Refers to llvmc itself.
linking
A compilation phase in which LLVM bytecode files and (optionally) native system libraries are combined to form a complete executable program.
optimization
A compilation phase in which LLVM bytecode is optimized.
phase
Refers to any one of the five compilation phases that that llvmc supports. The five phases are: preprocessing, translation, optimization, assembly, linking.
source language
Any common programming language (e.g. C, C++, Java, Stacker, ML, FORTRAN). These languages are distinguished from any of the lower level languages (such as LLVM or native assembly), by the fact that a translation phase is required before LLVM can be applied.
tool
Refers to any program in the LLVM tool set.
translation
A compilation phase in which source language code is translated into either LLVM assembly language or LLVM bytecode.

Valid CSS!Valid HTML 4.01!Reid Spencer
The LLVM Compiler Infrastructure
Last modified: $Date$