mirror of
				https://github.com/c64scene-ar/llvm-6502.git
				synced 2025-11-04 05:17:07 +00:00 
			
		
		
		
	Sorry for the massive commit, but I just wanted to knock this one down and it is really straightforward. There are still a couple trivial (i.e. not related to the content) things left to fix: - Use of raw HTML links where :doc:`...` and :ref:`...` could be used instead. If you are a newbie and want to help fix this it would make for some good bite-sized patches; more experienced developers should be focusing on adding new content (to this tutorial or elsewhere, but please _do not_ waste your time on formatting when there is such dire need for documentation (see docs/SphinxQuickstartTemplate.rst to get started writing)). - Highlighting of the kaleidoscope code blocks (currently left as bare `::`). I will be working on writing a custom Pygments highlighter for this, mostly as training for maintaining the `llvm` code-block's lexer in-tree. I want to do this because I am extremely unhappy with how it just "gives up" on the slightest deviation from the expected syntax and leaves the whole code-block un-highlighted. More generally I am looking at writing some Sphinx extensions and keeping them in-tree as well, to support common use cases that currently have no good solution (like "monospace text inside a link"). git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@169343 91177308-0d34-0410-b5e6-96231b3b80d8
		
			
				
	
	
		
			270 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			270 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
======================================================
 | 
						|
Kaleidoscope: Conclusion and other useful LLVM tidbits
 | 
						|
======================================================
 | 
						|
 | 
						|
.. contents::
 | 
						|
   :local:
 | 
						|
 | 
						|
Written by `Chris Lattner <mailto:sabre@nondot.org>`_
 | 
						|
 | 
						|
Tutorial Conclusion
 | 
						|
===================
 | 
						|
 | 
						|
Welcome to the final chapter of the "`Implementing a language with
 | 
						|
LLVM <index.html>`_" tutorial. In the course of this tutorial, we have
 | 
						|
grown our little Kaleidoscope language from being a useless toy, to
 | 
						|
being a semi-interesting (but probably still useless) toy. :)
 | 
						|
 | 
						|
It is interesting to see how far we've come, and how little code it has
 | 
						|
taken. We built the entire lexer, parser, AST, code generator, and an
 | 
						|
interactive run-loop (with a JIT!) by-hand in under 700 lines of
 | 
						|
(non-comment/non-blank) code.
 | 
						|
 | 
						|
Our little language supports a couple of interesting features: it
 | 
						|
supports user defined binary and unary operators, it uses JIT
 | 
						|
compilation for immediate evaluation, and it supports a few control flow
 | 
						|
constructs with SSA construction.
 | 
						|
 | 
						|
Part of the idea of this tutorial was to show you how easy and fun it
 | 
						|
can be to define, build, and play with languages. Building a compiler
 | 
						|
need not be a scary or mystical process! Now that you've seen some of
 | 
						|
the basics, I strongly encourage you to take the code and hack on it.
 | 
						|
For example, try adding:
 | 
						|
 | 
						|
-  **global variables** - While global variables have questional value
 | 
						|
   in modern software engineering, they are often useful when putting
 | 
						|
   together quick little hacks like the Kaleidoscope compiler itself.
 | 
						|
   Fortunately, our current setup makes it very easy to add global
 | 
						|
   variables: just have value lookup check to see if an unresolved
 | 
						|
   variable is in the global variable symbol table before rejecting it.
 | 
						|
   To create a new global variable, make an instance of the LLVM
 | 
						|
   ``GlobalVariable`` class.
 | 
						|
-  **typed variables** - Kaleidoscope currently only supports variables
 | 
						|
   of type double. This gives the language a very nice elegance, because
 | 
						|
   only supporting one type means that you never have to specify types.
 | 
						|
   Different languages have different ways of handling this. The easiest
 | 
						|
   way is to require the user to specify types for every variable
 | 
						|
   definition, and record the type of the variable in the symbol table
 | 
						|
   along with its Value\*.
 | 
						|
-  **arrays, structs, vectors, etc** - Once you add types, you can start
 | 
						|
   extending the type system in all sorts of interesting ways. Simple
 | 
						|
   arrays are very easy and are quite useful for many different
 | 
						|
   applications. Adding them is mostly an exercise in learning how the
 | 
						|
   LLVM `getelementptr <../LangRef.html#i_getelementptr>`_ instruction
 | 
						|
   works: it is so nifty/unconventional, it `has its own
 | 
						|
   FAQ <../GetElementPtr.html>`_! If you add support for recursive types
 | 
						|
   (e.g. linked lists), make sure to read the `section in the LLVM
 | 
						|
   Programmer's Manual <../ProgrammersManual.html#TypeResolve>`_ that
 | 
						|
   describes how to construct them.
 | 
						|
-  **standard runtime** - Our current language allows the user to access
 | 
						|
   arbitrary external functions, and we use it for things like "printd"
 | 
						|
   and "putchard". As you extend the language to add higher-level
 | 
						|
   constructs, often these constructs make the most sense if they are
 | 
						|
   lowered to calls into a language-supplied runtime. For example, if
 | 
						|
   you add hash tables to the language, it would probably make sense to
 | 
						|
   add the routines to a runtime, instead of inlining them all the way.
 | 
						|
-  **memory management** - Currently we can only access the stack in
 | 
						|
   Kaleidoscope. It would also be useful to be able to allocate heap
 | 
						|
   memory, either with calls to the standard libc malloc/free interface
 | 
						|
   or with a garbage collector. If you would like to use garbage
 | 
						|
   collection, note that LLVM fully supports `Accurate Garbage
 | 
						|
   Collection <../GarbageCollection.html>`_ including algorithms that
 | 
						|
   move objects and need to scan/update the stack.
 | 
						|
-  **debugger support** - LLVM supports generation of `DWARF Debug
 | 
						|
   info <../SourceLevelDebugging.html>`_ which is understood by common
 | 
						|
   debuggers like GDB. Adding support for debug info is fairly
 | 
						|
   straightforward. The best way to understand it is to compile some
 | 
						|
   C/C++ code with "``llvm-gcc -g -O0``" and taking a look at what it
 | 
						|
   produces.
 | 
						|
-  **exception handling support** - LLVM supports generation of `zero
 | 
						|
   cost exceptions <../ExceptionHandling.html>`_ which interoperate with
 | 
						|
   code compiled in other languages. You could also generate code by
 | 
						|
   implicitly making every function return an error value and checking
 | 
						|
   it. You could also make explicit use of setjmp/longjmp. There are
 | 
						|
   many different ways to go here.
 | 
						|
-  **object orientation, generics, database access, complex numbers,
 | 
						|
   geometric programming, ...** - Really, there is no end of crazy
 | 
						|
   features that you can add to the language.
 | 
						|
-  **unusual domains** - We've been talking about applying LLVM to a
 | 
						|
   domain that many people are interested in: building a compiler for a
 | 
						|
   specific language. However, there are many other domains that can use
 | 
						|
   compiler technology that are not typically considered. For example,
 | 
						|
   LLVM has been used to implement OpenGL graphics acceleration,
 | 
						|
   translate C++ code to ActionScript, and many other cute and clever
 | 
						|
   things. Maybe you will be the first to JIT compile a regular
 | 
						|
   expression interpreter into native code with LLVM?
 | 
						|
 | 
						|
Have fun - try doing something crazy and unusual. Building a language
 | 
						|
like everyone else always has, is much less fun than trying something a
 | 
						|
little crazy or off the wall and seeing how it turns out. If you get
 | 
						|
stuck or want to talk about it, feel free to email the `llvmdev mailing
 | 
						|
list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots
 | 
						|
of people who are interested in languages and are often willing to help
 | 
						|
out.
 | 
						|
 | 
						|
Before we end this tutorial, I want to talk about some "tips and tricks"
 | 
						|
for generating LLVM IR. These are some of the more subtle things that
 | 
						|
may not be obvious, but are very useful if you want to take advantage of
 | 
						|
LLVM's capabilities.
 | 
						|
 | 
						|
Properties of the LLVM IR
 | 
						|
=========================
 | 
						|
 | 
						|
We have a couple common questions about code in the LLVM IR form - lets
 | 
						|
just get these out of the way right now, shall we?
 | 
						|
 | 
						|
Target Independence
 | 
						|
-------------------
 | 
						|
 | 
						|
Kaleidoscope is an example of a "portable language": any program written
 | 
						|
in Kaleidoscope will work the same way on any target that it runs on.
 | 
						|
Many other languages have this property, e.g. lisp, java, haskell,
 | 
						|
javascript, python, etc (note that while these languages are portable,
 | 
						|
not all their libraries are).
 | 
						|
 | 
						|
One nice aspect of LLVM is that it is often capable of preserving target
 | 
						|
independence in the IR: you can take the LLVM IR for a
 | 
						|
Kaleidoscope-compiled program and run it on any target that LLVM
 | 
						|
supports, even emitting C code and compiling that on targets that LLVM
 | 
						|
doesn't support natively. You can trivially tell that the Kaleidoscope
 | 
						|
compiler generates target-independent code because it never queries for
 | 
						|
any target-specific information when generating code.
 | 
						|
 | 
						|
The fact that LLVM provides a compact, target-independent,
 | 
						|
representation for code gets a lot of people excited. Unfortunately,
 | 
						|
these people are usually thinking about C or a language from the C
 | 
						|
family when they are asking questions about language portability. I say
 | 
						|
"unfortunately", because there is really no way to make (fully general)
 | 
						|
C code portable, other than shipping the source code around (and of
 | 
						|
course, C source code is not actually portable in general either - ever
 | 
						|
port a really old application from 32- to 64-bits?).
 | 
						|
 | 
						|
The problem with C (again, in its full generality) is that it is heavily
 | 
						|
laden with target specific assumptions. As one simple example, the
 | 
						|
preprocessor often destructively removes target-independence from the
 | 
						|
code when it processes the input text:
 | 
						|
 | 
						|
.. code-block:: c
 | 
						|
 | 
						|
    #ifdef __i386__
 | 
						|
      int X = 1;
 | 
						|
    #else
 | 
						|
      int X = 42;
 | 
						|
    #endif
 | 
						|
 | 
						|
While it is possible to engineer more and more complex solutions to
 | 
						|
problems like this, it cannot be solved in full generality in a way that
 | 
						|
is better than shipping the actual source code.
 | 
						|
 | 
						|
That said, there are interesting subsets of C that can be made portable.
 | 
						|
If you are willing to fix primitive types to a fixed size (say int =
 | 
						|
32-bits, and long = 64-bits), don't care about ABI compatibility with
 | 
						|
existing binaries, and are willing to give up some other minor features,
 | 
						|
you can have portable code. This can make sense for specialized domains
 | 
						|
such as an in-kernel language.
 | 
						|
 | 
						|
Safety Guarantees
 | 
						|
-----------------
 | 
						|
 | 
						|
Many of the languages above are also "safe" languages: it is impossible
 | 
						|
for a program written in Java to corrupt its address space and crash the
 | 
						|
process (assuming the JVM has no bugs). Safety is an interesting
 | 
						|
property that requires a combination of language design, runtime
 | 
						|
support, and often operating system support.
 | 
						|
 | 
						|
It is certainly possible to implement a safe language in LLVM, but LLVM
 | 
						|
IR does not itself guarantee safety. The LLVM IR allows unsafe pointer
 | 
						|
casts, use after free bugs, buffer over-runs, and a variety of other
 | 
						|
problems. Safety needs to be implemented as a layer on top of LLVM and,
 | 
						|
conveniently, several groups have investigated this. Ask on the `llvmdev
 | 
						|
mailing list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if
 | 
						|
you are interested in more details.
 | 
						|
 | 
						|
Language-Specific Optimizations
 | 
						|
-------------------------------
 | 
						|
 | 
						|
One thing about LLVM that turns off many people is that it does not
 | 
						|
solve all the world's problems in one system (sorry 'world hunger',
 | 
						|
someone else will have to solve you some other day). One specific
 | 
						|
complaint is that people perceive LLVM as being incapable of performing
 | 
						|
high-level language-specific optimization: LLVM "loses too much
 | 
						|
information".
 | 
						|
 | 
						|
Unfortunately, this is really not the place to give you a full and
 | 
						|
unified version of "Chris Lattner's theory of compiler design". Instead,
 | 
						|
I'll make a few observations:
 | 
						|
 | 
						|
First, you're right that LLVM does lose information. For example, as of
 | 
						|
this writing, there is no way to distinguish in the LLVM IR whether an
 | 
						|
SSA-value came from a C "int" or a C "long" on an ILP32 machine (other
 | 
						|
than debug info). Both get compiled down to an 'i32' value and the
 | 
						|
information about what it came from is lost. The more general issue
 | 
						|
here, is that the LLVM type system uses "structural equivalence" instead
 | 
						|
of "name equivalence". Another place this surprises people is if you
 | 
						|
have two types in a high-level language that have the same structure
 | 
						|
(e.g. two different structs that have a single int field): these types
 | 
						|
will compile down into a single LLVM type and it will be impossible to
 | 
						|
tell what it came from.
 | 
						|
 | 
						|
Second, while LLVM does lose information, LLVM is not a fixed target: we
 | 
						|
continue to enhance and improve it in many different ways. In addition
 | 
						|
to adding new features (LLVM did not always support exceptions or debug
 | 
						|
info), we also extend the IR to capture important information for
 | 
						|
optimization (e.g. whether an argument is sign or zero extended,
 | 
						|
information about pointers aliasing, etc). Many of the enhancements are
 | 
						|
user-driven: people want LLVM to include some specific feature, so they
 | 
						|
go ahead and extend it.
 | 
						|
 | 
						|
Third, it is *possible and easy* to add language-specific optimizations,
 | 
						|
and you have a number of choices in how to do it. As one trivial
 | 
						|
example, it is easy to add language-specific optimization passes that
 | 
						|
"know" things about code compiled for a language. In the case of the C
 | 
						|
family, there is an optimization pass that "knows" about the standard C
 | 
						|
library functions. If you call "exit(0)" in main(), it knows that it is
 | 
						|
safe to optimize that into "return 0;" because C specifies what the
 | 
						|
'exit' function does.
 | 
						|
 | 
						|
In addition to simple library knowledge, it is possible to embed a
 | 
						|
variety of other language-specific information into the LLVM IR. If you
 | 
						|
have a specific need and run into a wall, please bring the topic up on
 | 
						|
the llvmdev list. At the very worst, you can always treat LLVM as if it
 | 
						|
were a "dumb code generator" and implement the high-level optimizations
 | 
						|
you desire in your front-end, on the language-specific AST.
 | 
						|
 | 
						|
Tips and Tricks
 | 
						|
===============
 | 
						|
 | 
						|
There is a variety of useful tips and tricks that you come to know after
 | 
						|
working on/with LLVM that aren't obvious at first glance. Instead of
 | 
						|
letting everyone rediscover them, this section talks about some of these
 | 
						|
issues.
 | 
						|
 | 
						|
Implementing portable offsetof/sizeof
 | 
						|
-------------------------------------
 | 
						|
 | 
						|
One interesting thing that comes up, if you are trying to keep the code
 | 
						|
generated by your compiler "target independent", is that you often need
 | 
						|
to know the size of some LLVM type or the offset of some field in an
 | 
						|
llvm structure. For example, you might need to pass the size of a type
 | 
						|
into a function that allocates memory.
 | 
						|
 | 
						|
Unfortunately, this can vary widely across targets: for example the
 | 
						|
width of a pointer is trivially target-specific. However, there is a
 | 
						|
`clever way to use the getelementptr
 | 
						|
instruction <http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_
 | 
						|
that allows you to compute this in a portable way.
 | 
						|
 | 
						|
Garbage Collected Stack Frames
 | 
						|
------------------------------
 | 
						|
 | 
						|
Some languages want to explicitly manage their stack frames, often so
 | 
						|
that they are garbage collected or to allow easy implementation of
 | 
						|
closures. There are often better ways to implement these features than
 | 
						|
explicit stack frames, but `LLVM does support
 | 
						|
them, <http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_
 | 
						|
if you want. It requires your front-end to convert the code into
 | 
						|
`Continuation Passing
 | 
						|
Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ and
 | 
						|
the use of tail calls (which LLVM also supports).
 | 
						|
 |