diff --git a/docs/BytecodeFormat.html b/docs/BytecodeFormat.html index fd02a480173..81a7eb24908 100644 --- a/docs/BytecodeFormat.html +++ b/docs/BytecodeFormat.html @@ -46,8 +46,8 @@ and Chris Lattner

Abstract
-

This document is an (after the fact) specification of the LLVM bytecode -file format. It documents the binary encoding rules of the bytecode file format +

This document describes the LLVM bytecode +file format. It specifies the binary encoding rules of the bytecode file format so that equivalent systems can encode bytecode files correctly. The LLVM bytecode representation is used to store the intermediate representation on disk in compacted form. @@ -58,7 +58,10 @@ disk in compacted form.

This section describes the general concepts of the bytecode file format -without getting into bit and byte level specifics.

+without getting into bit and byte level specifics. Note that the LLVM bytecode +format may change in the future, but will always be backwards compatible with +older formats. This document only describes the most current version of the +bytecode format.

Blocks
@@ -83,19 +86,20 @@ next in the file.

  • InstructionList (0x32).
  • CompactionTable (0x33).
  • -

    All blocks are variable length. They consume just enough bytes to express -their contents. Each block begins with an integer identifier and the length -of the block.

    +

    All blocks are variable length, and the block header specifies the size of +the block. All blocks are rounded aligned to even 32-bit boundaries, so they +always start and end of this boundary. Each block begins with an integer +identifier and the length of the block, which does not include the padding +bytes needed for alignment.

    Lists

    Most blocks are constructed of lists of information. Lists can be constructed of other lists, etc. This decomposition of information follows the containment -hierarchy of the LLVM Intermediate Representation. For example, a function is -composed of a list of basic blocks. Each basic block is composed of a set of -instructions. This list of list nesting and hierarchy is maintained in the -bytecode file.

    +hierarchy of the LLVM Intermediate Representation. For example, a function +contains a list of instructions (the terminator instructions implicitly define +the end of the basic blocks).

    A list is encoded into the file simply by encoding the number of entries as an integer followed by each of the entries. The reader knows when the list is done because it will have filled the list with the required numbe of entries. @@ -106,7 +110,7 @@ done because it will have filled the list with the required numbe of entries.

    Fields are units of information that LLVM knows how to write atomically. Most fields have a uniform length or some kind of length indication built into -their encoding. For example, a constant string (array of SByte or UByte) is +their encoding. For example, a constant string (array of bytes) is written simply as the length followed by the characters. Although this is similar to a list, constant strings are treated atomically and are thus fields.

    @@ -121,7 +125,8 @@ written and how the bits are to be interpreted.

    Each field that can be put out is encoded into the file using a small set of primitives. The rules for these primitives are described below.

    Variable Bit Rate Encoding

    -

    To minimize the number of bytes written for small quantities, an encoding +

    Most of the values written to LLVM bytecode files are small integers. To +minimize the number of bytes written for these quantities, an encoding scheme similar to UTF-8 is used to write integer data. The scheme is known as variable bit rate (vbr) encoding. In this encoding, the high bit of each byte is used to indicate if more bytes follow. If (byte & 0x80) is non-zero @@ -148,8 +153,15 @@ as follows:

    956-629,223,372,036,854,775,807 1063-691,180,591,620,717,411,303,423 -

    Note that in practice, the tenth byte could only encode bits 63 and 64 +

    Note that in practice, the tenth byte could only encode bit 63 since the maximum quantity to use this encoding is a 64-bit integer.

    + +

    Signed VBR values are encoded with the standard vbr encoding, but +with the sign bit as the low order bit instead of the high order bit. This +allows small negative quantities to be encoded efficiently. For example, -3 +is encoded as "((3 << 1) | 1)" and 3 is encoded as "(3 << 1) | +0)", emitted with the standard vbr encoding above.

    +

    The table below defines the encoding rules for type names used in the descriptions of blocks and fields in the next section. Any type name with the suffix _vbr indicate a quantity that is encoded using @@ -176,7 +188,7 @@ variable bit rate encoding as described above.

    int64_vbr A 64-bit signed integer that occupies from one to ten - bytes using variable bit rate encoding. + bytes using the signed variable bit rate encoding. char A single unsigned character encoded into one byte @@ -187,8 +199,7 @@ variable bit rate encoding as described above.

    string A uint_vbr indicating the length of the character string immediately followed by the characters of the string. There is no - terminating null byte in the string. Characters are interpreted as unsigned - char and are generally US-ASCII encoded. + terminating null byte in the string. data An arbitrarily long segment of data to which no @@ -219,18 +230,18 @@ bit and byte level specifics.

    fields in detail. These descriptions are provided in tabular form. Each table has four columns that specify:

      -
    1. Byte(s). The offset in bytes of the field from the start of +
    2. Byte(s): The offset in bytes of the field from the start of its container (block, list, other field).
    3. -
    4. Bit(s). The offset in bits of the field from the start of +
    5. Bit(s): The offset in bits of the field from the start of the byte field. Bits are always little endian. That is, bit addresses with smaller values have smaller address (i.e. 20 is at bit 0, 21 at 1, etc.)
    6. -
    7. Align? Indicates if this field is aligned to 32 bits or not. +
    8. Align?: Indicates if this field is aligned to 32 bits or not. This indicates where the next field starts, always on a 32 bit boundary.
    9. -
    10. Type. The basic type of information contained in the field.
    11. -
    12. Description. Descripts the contents of the field.
    13. +
    14. Type: The basic type of information contained in the field.
    15. +
    16. Description: Describes the contents of the field.
    @@ -240,20 +251,21 @@ bit and byte level specifics.

    of bytes known as blocks. The blocks are written sequentially to the file in the following order:

      -
    1. Signature. This block contains the file signature - (magic number) that identifies the file as LLVM bytecode.
    2. -
    3. Module Block. This is the top level block in a +
    4. Signature: This contains the file signature + (magic number) that identifies the file as LLVM bytecode and the bytecode + version number.
    5. +
    6. Module Block: This is the top level block in a bytecode file. It contains all the other blocks.
    7. -
    8. Global Type Pool. This block contains all the +
    9. Global Type Pool: This block contains all the global (module) level types.
    10. -
    11. Module Info. This block contains the types of the +
    12. Module Info: This block contains the types of the global variables and functions in the module as well as the constant initializers for the global variables
    13. -
    14. Constants. This block contains all the global +
    15. Constants: This block contains all the global constants except function arguments, global values and constant strings.
    16. -
    17. Functions. One function block is written for +
    18. Functions: One function block is written for each function in the module.
    19. -
    20. Symbol Table. The module level symbol table that +
    21. Symbol Table: The module level symbol table that provides names for the various other entries in the file is the final block written.
    @@ -261,7 +273,7 @@ bit and byte level specifics.

    Signature Block
    -

    The signature block occurs in every LLVM bytecode file and is always first. +

    The signature occurs in every LLVM bytecode file and is always first. It simply provides a few bytes of data to identify the file as being an LLVM bytecode file. This block is always four bytes in length and differs from the other blocks because there is no identifier and no block length at the start @@ -294,12 +306,18 @@ of the block. Essentially, this block is just the "magic number" for the file.

    The module block contains a small pre-amble and all the other blocks in the file. Of particular note, the bytecode format number is simply a 28-bit monotonically increase integer that identifiers the version of the bytecode -format. While the bytecode format version is not related to the LLVM release -(it doesn't automatically get increased with each new LLVM release), there is -a definite correspondence between the bytecode format version and the LLVM -release.

    -

    The table below shows the format of the module block header. The blocks it -contains are detailed in other sections.

    +format (which is not directly related to the LLVM release number). The +bytecode versions defined so far are (note that this document only describes +the latest version):

    + + + +

    The table below shows the format of the module block header. It is defined +by blocks described in other sections.

    @@ -337,11 +355,17 @@ contains are detailed in other sections.

    solely of other block types in sequence.
    Byte(s)
    + +

    Note that we plan to eventually expand the target description capabilities +of bytecode files to target +triples.

    +
    +
    Global Type Pool
    -

    The global type pool consists of type definitions. Their order of appearnce +

    The global type pool consists of type definitions. Their order of appearance in the file determines their slot number (0 based). Slot numbers are used to replace pointers in the intermediate representation. Each slot number uniquely identifies one entry in a type plane (a collection of values of the same type).