diff --git a/docs/BytecodeFormat.html b/docs/BytecodeFormat.html index fd02a480173..81a7eb24908 100644 --- a/docs/BytecodeFormat.html +++ b/docs/BytecodeFormat.html @@ -46,8 +46,8 @@ and Chris Lattner
This document is an (after the fact) specification of the LLVM bytecode -file format. It documents the binary encoding rules of the bytecode file format +
This document describes the LLVM bytecode +file format. It specifies the binary encoding rules of the bytecode file format so that equivalent systems can encode bytecode files correctly. The LLVM bytecode representation is used to store the intermediate representation on disk in compacted form. @@ -58,7 +58,10 @@ disk in compacted form.
This section describes the general concepts of the bytecode file format -without getting into bit and byte level specifics.
+without getting into bit and byte level specifics. Note that the LLVM bytecode +format may change in the future, but will always be backwards compatible with +older formats. This document only describes the most current version of the +bytecode format.All blocks are variable length. They consume just enough bytes to express -their contents. Each block begins with an integer identifier and the length -of the block.
+All blocks are variable length, and the block header specifies the size of +the block. All blocks are rounded aligned to even 32-bit boundaries, so they +always start and end of this boundary. Each block begins with an integer +identifier and the length of the block, which does not include the padding +bytes needed for alignment.
Most blocks are constructed of lists of information. Lists can be constructed of other lists, etc. This decomposition of information follows the containment -hierarchy of the LLVM Intermediate Representation. For example, a function is -composed of a list of basic blocks. Each basic block is composed of a set of -instructions. This list of list nesting and hierarchy is maintained in the -bytecode file.
+hierarchy of the LLVM Intermediate Representation. For example, a function +contains a list of instructions (the terminator instructions implicitly define +the end of the basic blocks).A list is encoded into the file simply by encoding the number of entries as an integer followed by each of the entries. The reader knows when the list is done because it will have filled the list with the required numbe of entries. @@ -106,7 +110,7 @@ done because it will have filled the list with the required numbe of entries.
Fields are units of information that LLVM knows how to write atomically. Most fields have a uniform length or some kind of length indication built into -their encoding. For example, a constant string (array of SByte or UByte) is +their encoding. For example, a constant string (array of bytes) is written simply as the length followed by the characters. Although this is similar to a list, constant strings are treated atomically and are thus fields.
@@ -121,7 +125,8 @@ written and how the bits are to be interpreted.Each field that can be put out is encoded into the file using a small set of primitives. The rules for these primitives are described below.
To minimize the number of bytes written for small quantities, an encoding +
Most of the values written to LLVM bytecode files are small integers. To +minimize the number of bytes written for these quantities, an encoding scheme similar to UTF-8 is used to write integer data. The scheme is known as variable bit rate (vbr) encoding. In this encoding, the high bit of each byte is used to indicate if more bytes follow. If (byte & 0x80) is non-zero @@ -148,8 +153,15 @@ as follows:
Note that in practice, the tenth byte could only encode bits 63 and 64 +
Note that in practice, the tenth byte could only encode bit 63 since the maximum quantity to use this encoding is a 64-bit integer.
+ +Signed VBR values are encoded with the standard vbr encoding, but +with the sign bit as the low order bit instead of the high order bit. This +allows small negative quantities to be encoded efficiently. For example, -3 +is encoded as "((3 << 1) | 1)" and 3 is encoded as "(3 << 1) | +0)", emitted with the standard vbr encoding above.
+The table below defines the encoding rules for type names used in the descriptions of blocks and fields in the next section. Any type name with the suffix _vbr indicate a quantity that is encoded using @@ -176,7 +188,7 @@ variable bit rate encoding as described above.
The signature block occurs in every LLVM bytecode file and is always first. +
The signature occurs in every LLVM bytecode file and is always first. It simply provides a few bytes of data to identify the file as being an LLVM bytecode file. This block is always four bytes in length and differs from the other blocks because there is no identifier and no block length at the start @@ -294,12 +306,18 @@ of the block. Essentially, this block is just the "magic number" for the file.
The module block contains a small pre-amble and all the other blocks in the file. Of particular note, the bytecode format number is simply a 28-bit monotonically increase integer that identifiers the version of the bytecode -format. While the bytecode format version is not related to the LLVM release -(it doesn't automatically get increased with each new LLVM release), there is -a definite correspondence between the bytecode format version and the LLVM -release.
-The table below shows the format of the module block header. The blocks it -contains are detailed in other sections.
+format (which is not directly related to the LLVM release number). The +bytecode versions defined so far are (note that this document only describes +the latest version): + +The table below shows the format of the module block header. It is defined +by blocks described in other sections.
Byte(s) | @@ -337,11 +355,17 @@ contains are detailed in other sections. solely of other block types in sequence.
---|
Note that we plan to eventually expand the target description capabilities +of bytecode files to target +triples.
+The global type pool consists of type definitions. Their order of appearnce +
The global type pool consists of type definitions. Their order of appearance in the file determines their slot number (0 based). Slot numbers are used to replace pointers in the intermediate representation. Each slot number uniquely identifies one entry in a type plane (a collection of values of the same type).