diff --git a/docs/tutorial/LangImpl8.html b/docs/tutorial/LangImpl8.html index 96b776211d8..c0c7bbb09a8 100644 --- a/docs/tutorial/LangImpl8.html +++ b/docs/tutorial/LangImpl8.html @@ -3,7 +3,8 @@
-@@ -117,13 +126,198 @@ are very useful if you want to take advantage of LLVM's capabilities.
+ + + + +We have a couple common questions about code in the LLVM IR form, lets just +get these out of the way right now shall we?
+ +Kaleidoscope is an example of a "portable language": any program written in +Kaleidoscope will work the same way on any target that it runs on. Many other +languages have this property, e.g. lisp, java, haskell, javascript, python, etc +(note that while these languages are portable, not all their libraries are).
+ +One nice aspect of LLVM is that it is often capable of preserving language +independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled +program and run it on any target that LLVM supports, even emitting C code and +compiling that on targets that LLVM doesn't support natively. You can trivially +tell that the Kaleidoscope compiler generates target-independent code because it +never queries for any target-specific information when generating code.
+ +The fact that LLVM provides a compact target-independent representation for +code gets a lot of people excited. Unfortunately, these people are usually +thinking about C or a language from the C family when they are asking questions +about language portability. I say "unfortunately", because there is really no +way to make (fully general) C code portable, other than shipping the source code +around (and of course, C source code is not actually portable in general +either - ever port a really old application from 32- to 64-bits?).
+ +The problem with C (again, in its full generality) is that it is heavily +laden with target specific assumptions. As one simple example, the preprocessor +often destructively removes target-independence from the code when it processes +the input text:
+ ++#ifdef __i386__ + int X = 1; +#else + int X = 42; +#endif ++
While it is possible to engineer more and more complex solutions to problems +like this, it cannot be solved in full generality in a way better than shipping +the actual source code.
+ +That said, there are interesting subsets of C that can be made portable. If +you are willing to fix primitive types to a fixed size (say int = 32-bits, +and long = 64-bits), don't care about ABI compatibility with existing binaries, +and are willing to give up some other minor features, you can have portable +code. This can even make real sense for specialized domains such as an +in-kernel language.
+ +Many of the languages above are also "safe" languages: it is impossible for +a program written in Java to corrupt its address space and crash the process. +Safety is an interesting property that requires a combination of language +design, runtime support, and often operating system support.
+ +It is certainly possible to implement a safe language in LLVM, but LLVM IR +does not itself guarantee safety. The LLVM IR allows unsafe pointer casts, +use after free bugs, buffer over-runs, and a variety of other problems. Safety +needs to be implemented as a layer on top of LLVM and, conveniently, several +groups have investigated this. Ask on the llvmdev mailing +list if you are interested in more details.
+ +One thing about LLVM that turns off many people is that it does not solve all +the world's problems in one system (sorry 'world hunger', someone else will have +to solve you some other day). One specific complaint is that people perceive +LLVM as being incapable of performing high-level language-specific optimization: +LLVM "loses too much information".
+ +Unfortunately, this is really not the place to give you a full and unified +version of "Chris Lattner's theory of compiler design". Instead, I'll make a +few observations:
+ +First, you're right that LLVM does lose information. For example, as of this +writing, there is no way to distinguish in the LLVM IR whether an SSA-value came +from a C "int" or a C "long" on an ILP32 machine (other than debug info). Both +get compiled down to an 'i32' value and the information about what it came from +is lost. The more general issue here is that the LLVM type system uses +"structural equivalence" instead of "name equivalence". Another place this +surprises people is if you have two types in a high-level language that have the +same structure (e.g. two different structs that have a single int field): these +types will compile down into a single LLVM type and it will be impossible to +tell what it came from.
+ +Second, while LLVM does lose information, LLVM is not a fixed target: we +continue to enhance and improve it in many different ways. In addition to +adding new features (LLVM did not always support exceptions or debug info), we +also extend the IR to capture important information for optimization (e.g. +whether an argument is sign or zero extended, information about pointers +aliasing, etc. Many of the enhancements are user-driven: people want LLVM to +do some specific feature, so they go ahead and extend it to do so.
+ +Third, it is certainly possible to add language-specific +optimizations, and you have a number of choices in how to do it. As one trivial +example, it is possible to add language-specific optimization passes that +"known" things about code compiled for a language. In the case of the C family, +there is an optimziation pass that "knows" about the standard C library +functions. If you call "exit(0)" in main(), it knows that it is safe to +optimize that into "return 0;" for example, because C specifies what the 'exit' +function does.
+ +In addition to simple library knowledge, it is possible to embed a variety of +other language-specific information into the LLVM IR. If you have a specific +need and run into a wall, please bring the topic up on the llvmdev list. At the +very worst, you can always treat LLVM as if it were a "dumb code generator" and +implement the high-level optimizations you desire in your front-end on the +language-specific AST. +
+ +There is a variety of useful tips and tricks that you come to know after +working on/with LLVM that aren't obvious at first glance. Instead of letting +everyone rediscover them, this section talks about some of these issues.
+ +One interesting thing that comes up if you are trying to keep the code +generated by your compiler "target independent" is that you often need to know +the size of some LLVM type or the offset of some field in an llvm structure. +For example, you might need to pass the size of a type into a function that +allocates memory.
+ +Unfortunately, this can vary widely across targets: for example the width of +a pointer is trivially target-specific. However, there is a clever +way to use the getelementptr instruction that allows you to compute this +in a portable way.
+ +Some languages want to explicitly manage their stack frames, often so that +they are garbage collected or to allow easy implementation of closures. There +are often better ways to implement these features than explicit stack frames, +but LLVM +does support them if you want. It requires your front-end to convert the +code into Continuation +Passing Style and use of tail calls (which LLVM also supports).