diff --git a/docs/GarbageCollection.html b/docs/GarbageCollection.html new file mode 100644 index 00000000000..fd5c900ec41 --- /dev/null +++ b/docs/GarbageCollection.html @@ -0,0 +1,418 @@ + + + + Accurate Garbage Collection with LLVM + + + + +
+ Accurate Garbage Collection with LLVM +
+ +
    +
  1. Introduction + +
  2. + +
  3. Interfaces for user programs + +
  4. + +
  5. Implementing a garbage collector + +
  6. + + +
+ +
+

Written by Chris Lattner

+
+ + +
+ Introduction +
+ + +
+ +

Garbage collection is a widely used technique that frees the programmer from +having to know the life-times of heap objects, making software easier to produce +and maintain. Many programming languages rely on garbage collection for +automatic memory management. There are two primary forms of garbage collection: +conservative and accurate.

+ +

Conservative garbage collection often does not require any special support +from either the language or the compiler: it can handle non-type-safe +programming languages (such as C/C++) and does not require any special +information from the compiler. The [LINK] Boehm collector is an example of a +state-of-the-art conservative collector.

+ +

Accurate garbage collection requires the ability to identify all pointers in +the program at run-time (which requires that the source-language be type-safe in +most cases). Identifying pointers at run-time requires compiler support to +locate all places that hold live pointer variables at run-time, including the +processor stack and registers.

+ +

+Conservative garbage collection is attractive because it does not require any +special compiler support, but it does have problems. In particular, because the +conservative garbage collector cannot know that a particular word in the +machine is a pointer, it cannot move live objects in the heap (preventing the +use of compacting and generational GC algorithms) and it can occasionally suffer +from memory leaks due to integer values that happen to point to objects in the +program. In addition, some aggressive compiler transformations can break +conservative garbage collectors (though these seem rare in practice). +

+ +

+Accurate garbage collectors do not suffer from any of these problems, but they +can suffer from degraded scalar optimization of the program. In particular, +because the runtime must be able to identify and update all pointers active in +the program, some optimizations are less effective. In practice, however, the +locality and performance benefits of using aggressive garbage allocation +techniques dominates any low-level losses. +

+ +

+This document describes the mechanisms and interfaces provided by LLVM to +support accurate garbage collection. +

+ +
+ + +
+ GC features provided and algorithms supported +
+ +
+ +

+LLVM provides support for a broad class of garbage collection algorithms, +including compacting semi-space collectors, mark-sweep collectors, generational +collectors, and even reference counting implementations. It includes support +for read and write barriers, and associating meta-data with stack objects (used for tagless garbage +collection). All LLVM code generators support garbage collection, including the +C backend. +

+ +

+We hope that the primitive support built into LLVM is sufficient to support a +broad class of garbage collected languages, including Scheme, ML, scripting +languages, Java, C#, etc. That said, the implemented garbage collectors may +need to be extended to support language-specific features such as finalization, +weak references, or other features. As these needs are identified and +implemented, they should be added to this specification. +

+ +

+LLVM does not currently support garbage collection of multi-threaded programs or +GC-safe points other than function calls, but these will be added in the future +as there is interest. +

+ +
+ + +
+ Interfaces for user programs +
+ + +
+ +

This section describes the interfaces provided by LLVM and by the garbage +collector run-time that should be used by user programs. As such, this is the +interface that front-end authors should generate code for. +

+ +
+ + +
+ Identifying GC roots on the stack: llvm.gcroot +
+ +
+ +
+ void %llvm.gcroot(<ty>** %ptrloc, <ty2>* %metadata) +
+ +

+The llvm.gcroot intrinsic is used to inform LLVM of a pointer variable +on the stack. The first argument contains the address of the variable on the +stack, and the second contains a pointer to metadata that should be associated +with the pointer (which must be a constant or global value address). At +runtime, the llvm.gcroot intrinsic stores a null pointer into the +specified location to initialize the pointer.

+ +

+Consider the following fragment of Java code: +

+ +
+       {
+         Object X;   // A null-initialized reference to an object
+         ...
+       }
+
+ +

+This block (which may be located in the middle of a function or in a loop nest), +could be compiled to this LLVM code: +

+ +
+Entry:
+   ;; In the entry block for the function, allocate the
+   ;; stack space for X, which is an LLVM pointer.
+   %X = alloca %Object*
+   ...
+
+   ;; "CodeBlock" is the block corresponding to the start
+   ;;  of the scope scope above.
+CodeBlock:
+   ;; Initialize the object, telling LLVM that it is now live.
+   ;; Java has type-tags on objects, so it doesn't need any
+   ;; metadata.
+   call void %llvm.gcroot(%Object** %X, sbyte* null)
+   ...
+
+   ;; As the pointer goes out of scope, store a null value into
+   ;; it, to indicate that the value is no longer live.
+   store %Object* null, %Object** %X
+   ...
+
+ +
+ + +
+ GC descriptor format for heap objects +
+ +
+ +

+Either from root meta data, or from object headers. Front-end can provide a +call-back to get descriptor from object without meta-data. +

+ +
+ + +
+ Allocating memory from the GC +
+ +
+ +
+ sbyte *%llvm_gc_allocate(unsigned %Size) +
+ +

The llvm_gc_allocate function is a global function defined by the +garbage collector implementation to allocate memory. It should return a +zeroed-out block of memory of the appropriate size.

+ +
+ + +
+ Reading and writing references to the heap +
+ +
+ +
+ sbyte *%llvm.gcread(sbyte **)
+ void %llvm.gcwrite(sbyte*, sbyte**) +
+ +

Several of the more interesting garbage collectors (e.g., generational +collectors) need to be informed when the mutator (the program that needs garbage +collection) reads or writes object references into the heap. In the case of a +generational collector, it needs to keep track of which "old" generation objects +have references stored into them. The amount of code that typically needs to be +executed is usually quite small, so the overall performance impact of the +inserted code is tolerable.

+ +

To support garbage collectors that use read or write barriers, LLVM provides +the llvm.gcread and llvm.gcwrite intrinsics. The first +intrinsic has exactly the same semantics as a non-volatile LLVM load and the +second has the same semantics as a non-volatile LLVM store. At code generation +time, these intrinsics are replaced with calls into the garbage collector +(llvm_gc_read and llvm_gc_write respectively), which are then +inlined into the code. +

+ +

+If you are writing a front-end for a garbage collected language, every load or +store of a reference from or to the heap should use these intrinsics instead of +normal LLVM loads/stores.

+ +
+ + +
+ Garbage collector startup and initialization +
+ +
+ +
+ void %llvm_gc_initialize() +
+ +

+The llvm_gc_initialize function should be called once before any other +garbage collection functions are called. This gives the garbage collector the +chance to initialize itself and allocate the heap spaces. +

+ +
+ + +
+ Explicit invocation of the garbage collector +
+ +
+ +
+ void %llvm_gc_collect() +
+ +

+The llvm_gc_collect function is exported by the garbage collector +implementations to provide a full collection, even when the heap is not +exhausted. This can be used by end-user code as a hint, and may be ignored by +the garbage collector. +

+ +
+ + + +
+ Implementing a garbage collector +
+ + +
+ +

+Implementing a garbage collector for LLVM is fairly straight-forward. The +implementation must include the llvm_gc_allocate and llvm_gc_collect functions, and it must implement +the read/write barrier functions as well. To +do this, it will probably have to trace through the roots +from the stack and understand the GC descriptors +for heap objects. Luckily, there are some example +implementations available. +

+
+ + + +
+ Implementing llvm_gc_read and llvm_gc_write +
+ +
+
+ void *llvm_gc_read(void **)
+ void llvm_gc_write(void*, void**) +
+ +

+These functions must be implemented in every garbage collector, even if +they do not need read/write barriers. In this case, just load or store the +pointer, then return. +

+ +

+If an actual read or write barrier is needed, it should be straight-forward to +implement it. Note that we may add a pointer to the start of the memory object +as a parameter in the future, if needed. +

+ +
+ + +
+ Tracing the GC roots from the program stack +
+ +
+
+ void llvm_cg_walk_gcroots(void (*FP)(void **Root, void *Meta)); +
+ +

+The llvm_cg_walk_gcroots function is a function provided by the code +generator that iterates through all of the GC roots on the stack, calling the +specified function pointer with each record. For each GC root, the address of +the pointer and the meta-data (from the llvm.gcroot intrinsic) are provided. +

+
+ + + +
+ GC implementations available +
+ +
+ +

+To make this more concrete, the currently implemented LLVM garbage collectors +all live in the llvm/runtime/GC directory in the LLVM source-base. +

+ +

+TODO: Brief overview of each. +

+ +
+ + + + +
+
+ Valid CSS! + Valid HTML 4.01! + + Chris Lattner
+ LLVM Compiler Infrastructure
+ Last modified: $Date$ +
+ + +