NOTE: This document is a work in progress!
This document describes the requirements, design, and configuration of the LLVM compiler driver, llvmc. The compiler driver knows about LLVM's tool set and can be configured to know about a variety of compilers for source languages. It uses this knowledge to execute the tools necessary to accomplish general compilation, optimization, and linking tasks. The main purpose of llvmc is to provide a simple and consistent interface to all compilation tasks. This reduces the burden on the end user who can just learn to use llvmc instead of the entire LLVM tool set and all the source language compilers compatible with LLVM.
The llvmc tool is a configurable compiler driver. As such, it isn't a compiler, optimizer, or a linker itself but it drives (invokes) other software that perform those tasks. If you are familiar with the GNU Compiler Collection's gcc tool, llvmc is very similar.
The following introductory sections will help you understand why this tool is necessary and what it does.
llvmc was invented to make compilation of user programs with LLVM-based tools easier. To accomplish this, llvmc strives to:
Additionally, llvmc makes it easier to write a compiler for use with LLVM, because it:
At a high level, llvmc operation is very simple. The basic action taken by llvmc is to simply invoke some tool or set of tools to fill the user's request for compilation. Every execution of llvmctakes the following sequence of steps:
llvmc's operation must be simple, regular and predictable. Developers need to be able to rely on it to take a consistent approach to compilation. For example, the invocation:
llvmc -O2 x.c y.c z.c -o xyz
must produce exactly the same results as:
llvmc -O2 x.c -o x.o llvmc -O2 y.c -o y.o llvmc -O2 z.c -o z.o llvmc -O2 x.o y.o z.o -o xyz
To accomplish this, llvmc uses a very simple goal oriented procedure to do its work. The overall goal is to produce a functioning executable. To accomplish this, llvmc always attempts to execute a series of compilation phases in the same sequence. However, the user's options to llvmc can cause the sequence of phases to start in the middle or finish early.
llvmc breaks every compilation task into the following five distinct phases:
The following table shows the inputs, outputs, and command line options applicabe to each phase.
Phase | Inputs | Outputs | Options |
---|---|---|---|
Preprocessing |
|
|
|
Translation |
|
|
|
Optimization |
|
|
|
Linking |
|
|
|
An action, with regard to llvmc is a basic operation that it takes in order to fulfill the user's request. Each phase of compilation will invoke zero or more actions in order to accomplish that phase.
Actions come in two forms:
This section of the document describes the configuration files used by llvmc. Configuration information is relatively static for a given release of LLVM and a compiler tool. However, the details may change from release to release of either. Users are encouraged to simply use the various options of the llvmc command and ignore the configuration of the tool. These configuration files are for compiler writers and LLVM developers. Those wishing to simply use llvmc don't need to understand this section but it may be instructive on how the tool works.
llvmc is highly configurable both on the command line and in configuration files. The options it understands are generic, consistent and simple by design. Furthermore, the llvmc options apply to the compilation of any LLVM enabled programming language. To be enabled as a supported source language compiler, a compiler writer must provide a configuration file that tells llvmc how to invoke the compiler and what its capabilities are. The purpose of the configuration files then is to allow compiler writers to specify to llvmc how the compiler should be invoked. Users may but are not advised to alter the compiler's llvmc configuration.
Because llvmc just invokes other programs, it must deal with the available command line options for those programs regardless of whether they were written for LLVM or not. Furthermore, not all compiler tools will have the same capabilities. Some compiler tools will simply generate LLVM assembly code, others will be able to generate fully optimized byte code. In general, llvmc doesn't make any assumptions about the capabilities or command line options of a sub-tool. It simply uses the details found in the configuration files and leaves it to the compiler writer to specify the configuration correctly.
This approach means that new compiler tools can be up and working very quickly. As a first cut, a tool can simply compile its source to raw (unoptimized) bytecode or LLVM assembly and llvmc can be configured to pick up the slack (translate LLVM assembly to bytecode, optimize the bytecode, generate native assembly, link, etc.). In fact, the compiler tools need not use any LLVM libraries, and it could be written in any language (instead of C++). The configuration data will allow the full range of optimization, assembly, and linking capabilities that LLVM provides to be added to these kinds of tools. Enabling the rapid development of front-ends is one of the primary goals of llvmc.
As a compiler tool matures, it may utilize the LLVM libraries and tools to more efficiently produce optimized bytecode directly in a single compilation and optimization program. In these cases, multiple tools would not be needed and the configuration data for the compiler would change.
Configuring llvmc to the needs and capabilities of a source language compiler is relatively straight-forward. A compiler writer must provide a definition of what to do for each of the five compilation phases for each of the optimization levels. The specification consists simply of prototypical command lines into which llvmc can substitute command line arguments and file names. Note that any given phase can be completely blank if the source language's compiler combines multiple phases into a single program. For example, quite often pre-processing, translation, and optimization are combined into a single program. The specification for such a compiler would have blank entries for pre-processing and translation but a full command line for optimization.
Each configuration file provides the details for a single source language that is to be compiled. This configuration information tells llvmc how to invoke the language's pre-processor, translator, optimizer, assembler and linker. Note that a given source language needn't provide all these tools as many of them exist in llvm currently.
llvmc always looks for files of a specific name. It uses the
first file with the name its looking for by searching directories in the
following order:
The first file found in this search will be used. Other files with the same name will be ignored even if they exist in one of the subsequent search locations.
In the directories searched, each configuration file is given a specific name to foster faster lookup (so llvmc doesn't have to do directory searches). The name of a given language specific configuration file is simply the same as the suffix used to identify files containing source in that language. For example, a configuration file for C++ source might be named cpp, C, or cxx. For languages that support multiple file suffixes, multiple (probably identical) files (or symbolic links) will need to be provided.
Which configuration files are read depends on the command line options and the suffixes of the file names provided on llvmc's command line. Note that the -x LANGUAGE option alters the language that llvmc uses for the subsequent files on the command line. Only the configuration files actually needed to complete llvmc's task are read. Other language specific files will be ignored.
The syntax of the configuration files is very simple and somewhat compatible with Java's property files. Here are the syntax rules:
The table below provides definitions of the allowed configuration items that may appear in a configuration file. Every item has a default value and does not need to appear in the configuration file. Missing items will have the default value. Each identifier may appear as all lower case, first letter capitalized or all upper case.
Name | Value Type | Description | Default | |
---|---|---|---|---|
LLVMC ITEMS | ||||
version | string | Provides the version string for the contents of this configuration file. What is accepted as a legal configuration file will change over time and this item tells llvmc which version should be expected. | b | |
LANG ITEMS | ||||
lang.name | string | Provides the common name for a language definition. For example "C++", "Pascal", "FORTRAN", etc. | blank | |
lang.opt1 | string | Specifies the parameters to give the optimizer when -O1 is specified on the llvmc command line. | -simplifycfg -instcombine -mem2reg | |
lang.opt2 | string | Specifies the parameters to give the optimizer when -O2 is specified on the llvmc command line. | TBD | |
lang.opt3 | string | Specifies the parameters to give the optimizer when -O3 is specified on the llvmc command line. | TBD | |
lang.opt4 | string | Specifies the parameters to give the optimizer when -O4 is specified on the llvmc command line. | TBD | |
lang.opt5 | string | Specifies the parameters to give the optimizer when -O5 is specified on the llvmc command line. | TBD | |
PREPROCESSOR ITEMS | ||||
preprocessor.command | command | This provides the command prototype that will be used to run the preprocessor. This is generally only used with the -E option. | <blank> | |
preprocessor.required | boolean | This item specifies whether the pre-processing phase is required by the language. If the value is true, then the preprocessor.command value must not be blank. With this option, llvmc will always run the preprocessor as it assumes that the translation and optimization phases don't know how to pre-process their input. | false | |
TRANSLATOR ITEMS | ||||
translator.command | command | This provides the command prototype that will be used to run the translator. Valid substitutions are %in% for the input file and %out% for the output file. | <blank> | |
translator.output | bytecode or assembly | This item specifies the kind of output the language's translator generates. | bytecode | |
translator.preprocesses | boolean | Indicates that the translator also preprocesses. If this is true, then llvmc will skip the pre-processing phase whenever the final phase is not pre-processing. | false | |
OPTIMIZER ITEMS | ||||
optimizer.command | command | This provides the command prototype that will be used to run the optimizer. Valid substitutions are %in% for the input file and %out% for the output file. | <blank> | |
optimizer.output | bytecode or assembly | This item specifies the kind of output the language's optimizer generates. Valid values are "assembly" and "bytecode" | bytecode | |
optimizer.preprocesses | boolean | Indicates that the optimizer also preprocesses. If this is true, then llvmc will skip the pre-processing phase whenever the final phase is optimization or later. | false | |
optimizer.translates | boolean | Indicates that the optimizer also translates. If this is true, then llvmc will skip the translation phase whenever the final phase is optimization or later. | false | |
ASSEMBLER ITEMS | ||||
assembler.command | command | This provides the command prototype that will be used to run the assembler. Valid substitutions are %in% for the input file and %out% for the output file. | <blank> |
On any configruation item that ends in command, you must specify substitution tokens. Substitution tokens begin and end with a percent sign (%) and are replaced by the corresponding text. Any substitution token may be given on any command line but some are more useful than others. In particular each command should have both an %in% and an %out% substittution. The table below provides definitions of each of the allowed substitution tokens.
Substitution Token | Replacement Description |
---|---|
%args% | Replaced with all the tool-specific arguments given to llvmc via the -T set of options. This just allows you to place these arguments in the correct place on the command line. If the %args% option does not appear on your command line, then you are explicitly disallowing the -T option for your tool. |
%force% | Replaced with the -f option if it was specified on the llvmc command line. This is intended to tell the compiler tool to force the overwrite of output files. |
%in% | Replaced with the full path of the input file. You needn't worry about the cascading of file names. llvmc will create temporary files and ensure that the output of one phase is the input to the next phase. |
%opt% | Replaced with the optimization options for the tool. If the tool understands the -O options then that will be passed. Otherwise, the lang.optN series of configuration items will specify which arguments are to be given. |
%out% | Replaced with the full path of the output file. Note that this is not necessarily the output file specified with the -o option on llvmc's command line. It might be a temporary file that will be passed to a subsequent phase's input. |
%stats% | If your command accepts the -stats option, use this substitution token. If the user requested -stats from the llvmc command line then this token will be replaced with -stats, otherwise it will be ignored. |
%target% | Replaced with the name of the target "machine" for which code should be generated. The value used here is taken from the llvmc option -march. |
%time% | If your command accepts the -time-passes option, use this substitution token. If the user requested -time-passes from the llvmc command line then this token will be replaced with -time-passes, otherwise it will be ignored. |
Since an example is always instructive, here's how the Stacker language configuration file looks.
# Stacker Configuration File For llvmc ########################################################## # Language definitions ########################################################## lang.name=Stacker lang.opt1=-simplifycfg -instcombine -mem2reg lang.opt2=-simplifycfg -instcombine -mem2reg -load-vn \ -gcse -dse -scalarrepl -sccp lang.opt3=-simplifycfg -instcombine -mem2reg -load-vn \ -gcse -dse -scalarrepl -sccp -branch-combine -adce \ -globaldce -inline -licm lang.opt4=-simplifycfg -instcombine -mem2reg -load-vn \ -gcse -dse -scalarrepl -sccp -ipconstprop \ -branch-combine -adce -globaldce -inline -licm lang.opt5=-simplifycfg -instcombine -mem2reg --load-vn \ -gcse -dse scalarrepl -sccp -ipconstprop \ -branch-combine -adce -globaldce -inline -licm \ -block-placement ########################################################## # Pre-processor definitions ########################################################## # Stacker doesn't have a preprocessor but the following # allows the -E option to be supported preprocessor.command=cp %in% %out% preprocessor.required=false ########################################################## # Translator definitions ########################################################## # To compile stacker source, we just run the stacker # compiler with a default stack size of 2048 entries. translator.command=stkrc -s 2048 %in% -o %out% %time% \ %stats% %force% %args% # stkrc doesn't preprocess but we set this to true so # that we don't run the cp command by default. translator.preprocesses=true # The translator is required to run. translator.required=true # stkrc doesn't handle the -On options translator.output=bytecode ########################################################## # Optimizer definitions ########################################################## # For optimization, we use the LLVM "opt" program optimizer.command=opt %in% -o %out% %opt% %time% %stats% \ %force% %args% optimizer.required = true # opt doesn't translate optimizer.translates = no # opt doesn't preprocess optimizer.preprocesses=no # opt produces bytecode optimizer.output = bc ########################################################## # Assembler definitions ########################################################## assembler.command=llc %in% -o %out% %target% %time% %stats%
This document uses precise terms in reference to the various artifacts and concepts related to compilation. The terms used throughout this document are defined below.