mirror of
https://github.com/c64scene-ar/llvm-6502.git
synced 2025-09-24 23:28:41 +00:00
Revert r103213. It broke several sections of live website.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@103219 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
348
docs/tutorial/LangImpl1.html
Normal file
348
docs/tutorial/LangImpl1.html
Normal file
@@ -0,0 +1,348 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
||||
"http://www.w3.org/TR/html4/strict.dtd">
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>Kaleidoscope: Tutorial Introduction and the Lexer</title>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
||||
<meta name="author" content="Chris Lattner">
|
||||
<link rel="stylesheet" href="../llvm.css" type="text/css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="doc_title">Kaleidoscope: Tutorial Introduction and the Lexer</div>
|
||||
|
||||
<ul>
|
||||
<li><a href="index.html">Up to Tutorial Index</a></li>
|
||||
<li>Chapter 1
|
||||
<ol>
|
||||
<li><a href="#intro">Tutorial Introduction</a></li>
|
||||
<li><a href="#language">The Basic Language</a></li>
|
||||
<li><a href="#lexer">The Lexer</a></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><a href="LangImpl2.html">Chapter 2</a>: Implementing a Parser and AST</li>
|
||||
</ul>
|
||||
|
||||
<div class="doc_author">
|
||||
<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
|
||||
runs through the implementation of a simple language, showing how fun and
|
||||
easy it can be. This tutorial will get you up and started as well as help to
|
||||
build a framework you can extend to other languages. The code in this tutorial
|
||||
can also be used as a playground to hack on other LLVM specific things.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The goal of this tutorial is to progressively unveil our language, describing
|
||||
how it is built up over time. This will let us cover a fairly broad range of
|
||||
language design and LLVM-specific usage issues, showing and explaining the code
|
||||
for it all along the way, without overwhelming you with tons of details up
|
||||
front.</p>
|
||||
|
||||
<p>It is useful to point out ahead of time that this tutorial is really about
|
||||
teaching compiler techniques and LLVM specifically, <em>not</em> about teaching
|
||||
modern and sane software engineering principles. In practice, this means that
|
||||
we'll take a number of shortcuts to simplify the exposition. For example, the
|
||||
code leaks memory, uses global variables all over the place, doesn't use nice
|
||||
design patterns like <a
|
||||
href="http://en.wikipedia.org/wiki/Visitor_pattern">visitors</a>, etc... but it
|
||||
is very simple. If you dig in and use the code as a basis for future projects,
|
||||
fixing these deficiencies shouldn't be hard.</p>
|
||||
|
||||
<p>I've tried to put this tutorial together in a way that makes chapters easy to
|
||||
skip over if you are already familiar with or are uninterested in the various
|
||||
pieces. The structure of the tutorial is:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><b><a href="#language">Chapter #1</a>: Introduction to the Kaleidoscope
|
||||
language, and the definition of its Lexer</b> - This shows where we are going
|
||||
and the basic functionality that we want it to do. In order to make this
|
||||
tutorial maximally understandable and hackable, we choose to implement
|
||||
everything in C++ instead of using lexer and parser generators. LLVM obviously
|
||||
works just fine with such tools, feel free to use one if you prefer.</li>
|
||||
<li><b><a href="LangImpl2.html">Chapter #2</a>: Implementing a Parser and
|
||||
AST</b> - With the lexer in place, we can talk about parsing techniques and
|
||||
basic AST construction. This tutorial describes recursive descent parsing and
|
||||
operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific,
|
||||
the code doesn't even link in LLVM at this point. :)</li>
|
||||
<li><b><a href="LangImpl3.html">Chapter #3</a>: Code generation to LLVM IR</b> -
|
||||
With the AST ready, we can show off how easy generation of LLVM IR really
|
||||
is.</li>
|
||||
<li><b><a href="LangImpl4.html">Chapter #4</a>: Adding JIT and Optimizer
|
||||
Support</b> - Because a lot of people are interested in using LLVM as a JIT,
|
||||
we'll dive right into it and show you the 3 lines it takes to add JIT support.
|
||||
LLVM is also useful in many other ways, but this is one simple and "sexy" way
|
||||
to shows off its power. :)</li>
|
||||
<li><b><a href="LangImpl5.html">Chapter #5</a>: Extending the Language: Control
|
||||
Flow</b> - With the language up and running, we show how to extend it with
|
||||
control flow operations (if/then/else and a 'for' loop). This gives us a chance
|
||||
to talk about simple SSA construction and control flow.</li>
|
||||
<li><b><a href="LangImpl6.html">Chapter #6</a>: Extending the Language:
|
||||
User-defined Operators</b> - This is a silly but fun chapter that talks about
|
||||
extending the language to let the user program define their own arbitrary
|
||||
unary and binary operators (with assignable precedence!). This lets us build a
|
||||
significant piece of the "language" as library routines.</li>
|
||||
<li><b><a href="LangImpl7.html">Chapter #7</a>: Extending the Language: Mutable
|
||||
Variables</b> - This chapter talks about adding user-defined local variables
|
||||
along with an assignment operator. The interesting part about this is how
|
||||
easy and trivial it is to construct SSA form in LLVM: no, LLVM does <em>not</em>
|
||||
require your front-end to construct SSA form!</li>
|
||||
<li><b><a href="LangImpl8.html">Chapter #8</a>: Conclusion and other useful LLVM
|
||||
tidbits</b> - This chapter wraps up the series by talking about potential
|
||||
ways to extend the language, but also includes a bunch of pointers to info about
|
||||
"special topics" like adding garbage collection support, exceptions, debugging,
|
||||
support for "spaghetti stacks", and a bunch of other tips and tricks.</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<p>By the end of the tutorial, we'll have written a bit less than 700 lines of
|
||||
non-comment, non-blank, lines of code. With this small amount of code, we'll
|
||||
have built up a very reasonable compiler for a non-trivial language including
|
||||
a hand-written lexer, parser, AST, as well as code generation support with a JIT
|
||||
compiler. While other systems may have interesting "hello world" tutorials,
|
||||
I think the breadth of this tutorial is a great testament to the strengths of
|
||||
LLVM and why you should consider it if you're interested in language or compiler
|
||||
design.</p>
|
||||
|
||||
<p>A note about this tutorial: we expect you to extend the language and play
|
||||
with it on your own. Take the code and go crazy hacking away at it, compilers
|
||||
don't need to be scary creatures - it can be a lot of fun to play with
|
||||
languages!</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="language">The Basic Language</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>This tutorial will be illustrated with a toy language that we'll call
|
||||
"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>" (derived
|
||||
from "meaning beautiful, form, and view").
|
||||
Kaleidoscope is a procedural language that allows you to define functions, use
|
||||
conditionals, math, etc. Over the course of the tutorial, we'll extend
|
||||
Kaleidoscope to support the if/then/else construct, a for loop, user defined
|
||||
operators, JIT compilation with a simple command line interface, etc.</p>
|
||||
|
||||
<p>Because we want to keep things simple, the only datatype in Kaleidoscope is a
|
||||
64-bit floating point type (aka 'double' in C parlance). As such, all values
|
||||
are implicitly double precision and the language doesn't require type
|
||||
declarations. This gives the language a very nice and simple syntax. For
|
||||
example, the following simple example computes <a
|
||||
href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
# Compute the x'th fibonacci number.
|
||||
def fib(x)
|
||||
if x < 3 then
|
||||
1
|
||||
else
|
||||
fib(x-1)+fib(x-2)
|
||||
|
||||
# This expression will compute the 40th number.
|
||||
fib(40)
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
|
||||
JIT makes this completely trivial). This means that you can use the 'extern'
|
||||
keyword to define a function before you use it (this is also useful for mutually
|
||||
recursive functions). For example:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
extern sin(arg);
|
||||
extern cos(arg);
|
||||
extern atan2(arg1 arg2);
|
||||
|
||||
atan2(sin(.4), cos(42))
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>A more interesting example is included in Chapter 6 where we write a little
|
||||
Kaleidoscope application that <a href="LangImpl6.html#example">displays
|
||||
a Mandelbrot Set</a> at various levels of magnification.</p>
|
||||
|
||||
<p>Lets dive into the implementation of this language!</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="lexer">The Lexer</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>When it comes to implementing a language, the first thing needed is
|
||||
the ability to process a text file and recognize what it says. The traditional
|
||||
way to do this is to use a "<a
|
||||
href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
|
||||
to break the input up into "tokens". Each token returned by the lexer includes
|
||||
a token code and potentially some metadata (e.g. the numeric value of a number).
|
||||
First, we define the possibilities:
|
||||
</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
|
||||
// of these for known things.
|
||||
enum Token {
|
||||
tok_eof = -1,
|
||||
|
||||
// commands
|
||||
tok_def = -2, tok_extern = -3,
|
||||
|
||||
// primary
|
||||
tok_identifier = -4, tok_number = -5,
|
||||
};
|
||||
|
||||
static std::string IdentifierStr; // Filled in if tok_identifier
|
||||
static double NumVal; // Filled in if tok_number
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>Each token returned by our lexer will either be one of the Token enum values
|
||||
or it will be an 'unknown' character like '+', which is returned as its ASCII
|
||||
value. If the current token is an identifier, the <tt>IdentifierStr</tt>
|
||||
global variable holds the name of the identifier. If the current token is a
|
||||
numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
|
||||
global variables for simplicity, this is not the best choice for a real language
|
||||
implementation :).
|
||||
</p>
|
||||
|
||||
<p>The actual implementation of the lexer is a single function named
|
||||
<tt>gettok</tt>. The <tt>gettok</tt> function is called to return the next token
|
||||
from standard input. Its definition starts as:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
/// gettok - Return the next token from standard input.
|
||||
static int gettok() {
|
||||
static int LastChar = ' ';
|
||||
|
||||
// Skip any whitespace.
|
||||
while (isspace(LastChar))
|
||||
LastChar = getchar();
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>
|
||||
<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
|
||||
characters one at a time from standard input. It eats them as it recognizes
|
||||
them and stores the last character read, but not processed, in LastChar. The
|
||||
first thing that it has to do is ignore whitespace between tokens. This is
|
||||
accomplished with the loop above.</p>
|
||||
|
||||
<p>The next thing <tt>gettok</tt> needs to do is recognize identifiers and
|
||||
specific keywords like "def". Kaleidoscope does this with this simple loop:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
|
||||
IdentifierStr = LastChar;
|
||||
while (isalnum((LastChar = getchar())))
|
||||
IdentifierStr += LastChar;
|
||||
|
||||
if (IdentifierStr == "def") return tok_def;
|
||||
if (IdentifierStr == "extern") return tok_extern;
|
||||
return tok_identifier;
|
||||
}
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that this code sets the '<tt>IdentifierStr</tt>' global whenever it
|
||||
lexes an identifier. Also, since language keywords are matched by the same
|
||||
loop, we handle them here inline. Numeric values are similar:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
|
||||
std::string NumStr;
|
||||
do {
|
||||
NumStr += LastChar;
|
||||
LastChar = getchar();
|
||||
} while (isdigit(LastChar) || LastChar == '.');
|
||||
|
||||
NumVal = strtod(NumStr.c_str(), 0);
|
||||
return tok_number;
|
||||
}
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>This is all pretty straight-forward code for processing input. When reading
|
||||
a numeric value from input, we use the C <tt>strtod</tt> function to convert it
|
||||
to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
|
||||
sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
|
||||
if you typed in "1.23". Feel free to extend it :). Next we handle comments:
|
||||
</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
if (LastChar == '#') {
|
||||
// Comment until end of line.
|
||||
do LastChar = getchar();
|
||||
while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
|
||||
|
||||
if (LastChar != EOF)
|
||||
return gettok();
|
||||
}
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>We handle comments by skipping to the end of the line and then return the
|
||||
next token. Finally, if the input doesn't match one of the above cases, it is
|
||||
either an operator character like '+' or the end of the file. These are handled
|
||||
with this code:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
// Check for end of file. Don't eat the EOF.
|
||||
if (LastChar == EOF)
|
||||
return tok_eof;
|
||||
|
||||
// Otherwise, just return the character as its ascii value.
|
||||
int ThisChar = LastChar;
|
||||
LastChar = getchar();
|
||||
return ThisChar;
|
||||
}
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>With this, we have the complete lexer for the basic Kaleidoscope language
|
||||
(the <a href="LangImpl2.html#code">full code listing</a> for the Lexer is
|
||||
available in the <a href="LangImpl2.html">next chapter</a> of the tutorial).
|
||||
Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
|
||||
build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
|
||||
so that you can use the lexer and parser together.
|
||||
</p>
|
||||
|
||||
<a href="LangImpl2.html">Next: Implementing a Parser and AST</a>
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<hr>
|
||||
<address>
|
||||
<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
|
||||
src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
|
||||
<a href="http://validator.w3.org/check/referer"><img
|
||||
src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
|
||||
|
||||
<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
|
||||
<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
|
||||
Last modified: $Date$
|
||||
</address>
|
||||
</body>
|
||||
</html>
|
1233
docs/tutorial/LangImpl2.html
Normal file
1233
docs/tutorial/LangImpl2.html
Normal file
File diff suppressed because it is too large
Load Diff
1269
docs/tutorial/LangImpl3.html
Normal file
1269
docs/tutorial/LangImpl3.html
Normal file
File diff suppressed because it is too large
Load Diff
1132
docs/tutorial/LangImpl4.html
Normal file
1132
docs/tutorial/LangImpl4.html
Normal file
File diff suppressed because it is too large
Load Diff
BIN
docs/tutorial/LangImpl5-cfg.png
Normal file
BIN
docs/tutorial/LangImpl5-cfg.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 38 KiB |
1777
docs/tutorial/LangImpl5.html
Normal file
1777
docs/tutorial/LangImpl5.html
Normal file
File diff suppressed because it is too large
Load Diff
1814
docs/tutorial/LangImpl6.html
Normal file
1814
docs/tutorial/LangImpl6.html
Normal file
File diff suppressed because it is too large
Load Diff
2164
docs/tutorial/LangImpl7.html
Normal file
2164
docs/tutorial/LangImpl7.html
Normal file
File diff suppressed because it is too large
Load Diff
365
docs/tutorial/LangImpl8.html
Normal file
365
docs/tutorial/LangImpl8.html
Normal file
@@ -0,0 +1,365 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
||||
"http://www.w3.org/TR/html4/strict.dtd">
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>Kaleidoscope: Conclusion and other useful LLVM tidbits</title>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
||||
<meta name="author" content="Chris Lattner">
|
||||
<link rel="stylesheet" href="../llvm.css" type="text/css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="doc_title">Kaleidoscope: Conclusion and other useful LLVM
|
||||
tidbits</div>
|
||||
|
||||
<ul>
|
||||
<li><a href="index.html">Up to Tutorial Index</a></li>
|
||||
<li>Chapter 8
|
||||
<ol>
|
||||
<li><a href="#conclusion">Tutorial Conclusion</a></li>
|
||||
<li><a href="#llvmirproperties">Properties of LLVM IR</a>
|
||||
<ul>
|
||||
<li><a href="#targetindep">Target Independence</a></li>
|
||||
<li><a href="#safety">Safety Guarantees</a></li>
|
||||
<li><a href="#langspecific">Language-Specific Optimizations</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><a href="#tipsandtricks">Tips and Tricks</a>
|
||||
<ul>
|
||||
<li><a href="#offsetofsizeof">Implementing portable
|
||||
offsetof/sizeof</a></li>
|
||||
<li><a href="#gcstack">Garbage Collected Stack Frames</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ol>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
|
||||
<div class="doc_author">
|
||||
<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="conclusion">Tutorial Conclusion</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Welcome to the the final chapter of the "<a href="index.html">Implementing a
|
||||
language with LLVM</a>" tutorial. In the course of this tutorial, we have grown
|
||||
our little Kaleidoscope language from being a useless toy, to being a
|
||||
semi-interesting (but probably still useless) toy. :)</p>
|
||||
|
||||
<p>It is interesting to see how far we've come, and how little code it has
|
||||
taken. We built the entire lexer, parser, AST, code generator, and an
|
||||
interactive run-loop (with a JIT!) by-hand in under 700 lines of
|
||||
(non-comment/non-blank) code.</p>
|
||||
|
||||
<p>Our little language supports a couple of interesting features: it supports
|
||||
user defined binary and unary operators, it uses JIT compilation for immediate
|
||||
evaluation, and it supports a few control flow constructs with SSA construction.
|
||||
</p>
|
||||
|
||||
<p>Part of the idea of this tutorial was to show you how easy and fun it can be
|
||||
to define, build, and play with languages. Building a compiler need not be a
|
||||
scary or mystical process! Now that you've seen some of the basics, I strongly
|
||||
encourage you to take the code and hack on it. For example, try adding:</p>
|
||||
|
||||
<ul>
|
||||
<li><b>global variables</b> - While global variables have questional value in
|
||||
modern software engineering, they are often useful when putting together quick
|
||||
little hacks like the Kaleidoscope compiler itself. Fortunately, our current
|
||||
setup makes it very easy to add global variables: just have value lookup check
|
||||
to see if an unresolved variable is in the global variable symbol table before
|
||||
rejecting it. To create a new global variable, make an instance of the LLVM
|
||||
<tt>GlobalVariable</tt> class.</li>
|
||||
|
||||
<li><b>typed variables</b> - Kaleidoscope currently only supports variables of
|
||||
type double. This gives the language a very nice elegance, because only
|
||||
supporting one type means that you never have to specify types. Different
|
||||
languages have different ways of handling this. The easiest way is to require
|
||||
the user to specify types for every variable definition, and record the type
|
||||
of the variable in the symbol table along with its Value*.</li>
|
||||
|
||||
<li><b>arrays, structs, vectors, etc</b> - Once you add types, you can start
|
||||
extending the type system in all sorts of interesting ways. Simple arrays are
|
||||
very easy and are quite useful for many different applications. Adding them is
|
||||
mostly an exercise in learning how the LLVM <a
|
||||
href="../LangRef.html#i_getelementptr">getelementptr</a> instruction works: it
|
||||
is so nifty/unconventional, it <a
|
||||
href="../GetElementPtr.html">has its own FAQ</a>! If you add support
|
||||
for recursive types (e.g. linked lists), make sure to read the <a
|
||||
href="../ProgrammersManual.html#TypeResolve">section in the LLVM
|
||||
Programmer's Manual</a> that describes how to construct them.</li>
|
||||
|
||||
<li><b>standard runtime</b> - Our current language allows the user to access
|
||||
arbitrary external functions, and we use it for things like "printd" and
|
||||
"putchard". As you extend the language to add higher-level constructs, often
|
||||
these constructs make the most sense if they are lowered to calls into a
|
||||
language-supplied runtime. For example, if you add hash tables to the language,
|
||||
it would probably make sense to add the routines to a runtime, instead of
|
||||
inlining them all the way.</li>
|
||||
|
||||
<li><b>memory management</b> - Currently we can only access the stack in
|
||||
Kaleidoscope. It would also be useful to be able to allocate heap memory,
|
||||
either with calls to the standard libc malloc/free interface or with a garbage
|
||||
collector. If you would like to use garbage collection, note that LLVM fully
|
||||
supports <a href="../GarbageCollection.html">Accurate Garbage Collection</a>
|
||||
including algorithms that move objects and need to scan/update the stack.</li>
|
||||
|
||||
<li><b>debugger support</b> - LLVM supports generation of <a
|
||||
href="../SourceLevelDebugging.html">DWARF Debug info</a> which is understood by
|
||||
common debuggers like GDB. Adding support for debug info is fairly
|
||||
straightforward. The best way to understand it is to compile some C/C++ code
|
||||
with "<tt>llvm-gcc -g -O0</tt>" and taking a look at what it produces.</li>
|
||||
|
||||
<li><b>exception handling support</b> - LLVM supports generation of <a
|
||||
href="../ExceptionHandling.html">zero cost exceptions</a> which interoperate
|
||||
with code compiled in other languages. You could also generate code by
|
||||
implicitly making every function return an error value and checking it. You
|
||||
could also make explicit use of setjmp/longjmp. There are many different ways
|
||||
to go here.</li>
|
||||
|
||||
<li><b>object orientation, generics, database access, complex numbers,
|
||||
geometric programming, ...</b> - Really, there is
|
||||
no end of crazy features that you can add to the language.</li>
|
||||
|
||||
<li><b>unusual domains</b> - We've been talking about applying LLVM to a domain
|
||||
that many people are interested in: building a compiler for a specific language.
|
||||
However, there are many other domains that can use compiler technology that are
|
||||
not typically considered. For example, LLVM has been used to implement OpenGL
|
||||
graphics acceleration, translate C++ code to ActionScript, and many other
|
||||
cute and clever things. Maybe you will be the first to JIT compile a regular
|
||||
expression interpreter into native code with LLVM?</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
Have fun - try doing something crazy and unusual. Building a language like
|
||||
everyone else always has, is much less fun than trying something a little crazy
|
||||
or off the wall and seeing how it turns out. If you get stuck or want to talk
|
||||
about it, feel free to email the <a
|
||||
href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing
|
||||
list</a>: it has lots of people who are interested in languages and are often
|
||||
willing to help out.
|
||||
</p>
|
||||
|
||||
<p>Before we end this tutorial, I want to talk about some "tips and tricks" for generating
|
||||
LLVM IR. These are some of the more subtle things that may not be obvious, but
|
||||
are very useful if you want to take advantage of LLVM's capabilities.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="llvmirproperties">Properties of the LLVM
|
||||
IR</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>We have a couple common questions about code in the LLVM IR form - lets just
|
||||
get these out of the way right now, shall we?</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ======================================================================= -->
|
||||
<div class="doc_subsubsection"><a name="targetindep">Target
|
||||
Independence</a></div>
|
||||
<!-- ======================================================================= -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Kaleidoscope is an example of a "portable language": any program written in
|
||||
Kaleidoscope will work the same way on any target that it runs on. Many other
|
||||
languages have this property, e.g. lisp, java, haskell, javascript, python, etc
|
||||
(note that while these languages are portable, not all their libraries are).</p>
|
||||
|
||||
<p>One nice aspect of LLVM is that it is often capable of preserving target
|
||||
independence in the IR: you can take the LLVM IR for a Kaleidoscope-compiled
|
||||
program and run it on any target that LLVM supports, even emitting C code and
|
||||
compiling that on targets that LLVM doesn't support natively. You can trivially
|
||||
tell that the Kaleidoscope compiler generates target-independent code because it
|
||||
never queries for any target-specific information when generating code.</p>
|
||||
|
||||
<p>The fact that LLVM provides a compact, target-independent, representation for
|
||||
code gets a lot of people excited. Unfortunately, these people are usually
|
||||
thinking about C or a language from the C family when they are asking questions
|
||||
about language portability. I say "unfortunately", because there is really no
|
||||
way to make (fully general) C code portable, other than shipping the source code
|
||||
around (and of course, C source code is not actually portable in general
|
||||
either - ever port a really old application from 32- to 64-bits?).</p>
|
||||
|
||||
<p>The problem with C (again, in its full generality) is that it is heavily
|
||||
laden with target specific assumptions. As one simple example, the preprocessor
|
||||
often destructively removes target-independence from the code when it processes
|
||||
the input text:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
#ifdef __i386__
|
||||
int X = 1;
|
||||
#else
|
||||
int X = 42;
|
||||
#endif
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>While it is possible to engineer more and more complex solutions to problems
|
||||
like this, it cannot be solved in full generality in a way that is better than shipping
|
||||
the actual source code.</p>
|
||||
|
||||
<p>That said, there are interesting subsets of C that can be made portable. If
|
||||
you are willing to fix primitive types to a fixed size (say int = 32-bits,
|
||||
and long = 64-bits), don't care about ABI compatibility with existing binaries,
|
||||
and are willing to give up some other minor features, you can have portable
|
||||
code. This can make sense for specialized domains such as an
|
||||
in-kernel language.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ======================================================================= -->
|
||||
<div class="doc_subsubsection"><a name="safety">Safety Guarantees</a></div>
|
||||
<!-- ======================================================================= -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Many of the languages above are also "safe" languages: it is impossible for
|
||||
a program written in Java to corrupt its address space and crash the process
|
||||
(assuming the JVM has no bugs).
|
||||
Safety is an interesting property that requires a combination of language
|
||||
design, runtime support, and often operating system support.</p>
|
||||
|
||||
<p>It is certainly possible to implement a safe language in LLVM, but LLVM IR
|
||||
does not itself guarantee safety. The LLVM IR allows unsafe pointer casts,
|
||||
use after free bugs, buffer over-runs, and a variety of other problems. Safety
|
||||
needs to be implemented as a layer on top of LLVM and, conveniently, several
|
||||
groups have investigated this. Ask on the <a
|
||||
href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">llvmdev mailing
|
||||
list</a> if you are interested in more details.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ======================================================================= -->
|
||||
<div class="doc_subsubsection"><a name="langspecific">Language-Specific
|
||||
Optimizations</a></div>
|
||||
<!-- ======================================================================= -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>One thing about LLVM that turns off many people is that it does not solve all
|
||||
the world's problems in one system (sorry 'world hunger', someone else will have
|
||||
to solve you some other day). One specific complaint is that people perceive
|
||||
LLVM as being incapable of performing high-level language-specific optimization:
|
||||
LLVM "loses too much information".</p>
|
||||
|
||||
<p>Unfortunately, this is really not the place to give you a full and unified
|
||||
version of "Chris Lattner's theory of compiler design". Instead, I'll make a
|
||||
few observations:</p>
|
||||
|
||||
<p>First, you're right that LLVM does lose information. For example, as of this
|
||||
writing, there is no way to distinguish in the LLVM IR whether an SSA-value came
|
||||
from a C "int" or a C "long" on an ILP32 machine (other than debug info). Both
|
||||
get compiled down to an 'i32' value and the information about what it came from
|
||||
is lost. The more general issue here, is that the LLVM type system uses
|
||||
"structural equivalence" instead of "name equivalence". Another place this
|
||||
surprises people is if you have two types in a high-level language that have the
|
||||
same structure (e.g. two different structs that have a single int field): these
|
||||
types will compile down into a single LLVM type and it will be impossible to
|
||||
tell what it came from.</p>
|
||||
|
||||
<p>Second, while LLVM does lose information, LLVM is not a fixed target: we
|
||||
continue to enhance and improve it in many different ways. In addition to
|
||||
adding new features (LLVM did not always support exceptions or debug info), we
|
||||
also extend the IR to capture important information for optimization (e.g.
|
||||
whether an argument is sign or zero extended, information about pointers
|
||||
aliasing, etc). Many of the enhancements are user-driven: people want LLVM to
|
||||
include some specific feature, so they go ahead and extend it.</p>
|
||||
|
||||
<p>Third, it is <em>possible and easy</em> to add language-specific
|
||||
optimizations, and you have a number of choices in how to do it. As one trivial
|
||||
example, it is easy to add language-specific optimization passes that
|
||||
"know" things about code compiled for a language. In the case of the C family,
|
||||
there is an optimization pass that "knows" about the standard C library
|
||||
functions. If you call "exit(0)" in main(), it knows that it is safe to
|
||||
optimize that into "return 0;" because C specifies what the 'exit'
|
||||
function does.</p>
|
||||
|
||||
<p>In addition to simple library knowledge, it is possible to embed a variety of
|
||||
other language-specific information into the LLVM IR. If you have a specific
|
||||
need and run into a wall, please bring the topic up on the llvmdev list. At the
|
||||
very worst, you can always treat LLVM as if it were a "dumb code generator" and
|
||||
implement the high-level optimizations you desire in your front-end, on the
|
||||
language-specific AST.
|
||||
</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="tipsandtricks">Tips and Tricks</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>There is a variety of useful tips and tricks that you come to know after
|
||||
working on/with LLVM that aren't obvious at first glance. Instead of letting
|
||||
everyone rediscover them, this section talks about some of these issues.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ======================================================================= -->
|
||||
<div class="doc_subsubsection"><a name="offsetofsizeof">Implementing portable
|
||||
offsetof/sizeof</a></div>
|
||||
<!-- ======================================================================= -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>One interesting thing that comes up, if you are trying to keep the code
|
||||
generated by your compiler "target independent", is that you often need to know
|
||||
the size of some LLVM type or the offset of some field in an llvm structure.
|
||||
For example, you might need to pass the size of a type into a function that
|
||||
allocates memory.</p>
|
||||
|
||||
<p>Unfortunately, this can vary widely across targets: for example the width of
|
||||
a pointer is trivially target-specific. However, there is a <a
|
||||
href="http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt">clever
|
||||
way to use the getelementptr instruction</a> that allows you to compute this
|
||||
in a portable way.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ======================================================================= -->
|
||||
<div class="doc_subsubsection"><a name="gcstack">Garbage Collected
|
||||
Stack Frames</a></div>
|
||||
<!-- ======================================================================= -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Some languages want to explicitly manage their stack frames, often so that
|
||||
they are garbage collected or to allow easy implementation of closures. There
|
||||
are often better ways to implement these features than explicit stack frames,
|
||||
but <a
|
||||
href="http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt">LLVM
|
||||
does support them,</a> if you want. It requires your front-end to convert the
|
||||
code into <a
|
||||
href="http://en.wikipedia.org/wiki/Continuation-passing_style">Continuation
|
||||
Passing Style</a> and the use of tail calls (which LLVM also supports).</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<hr>
|
||||
<address>
|
||||
<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
|
||||
src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
|
||||
<a href="http://validator.w3.org/check/referer"><img
|
||||
src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
|
||||
|
||||
<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
|
||||
<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
|
||||
Last modified: $Date$
|
||||
</address>
|
||||
</body>
|
||||
</html>
|
28
docs/tutorial/Makefile
Normal file
28
docs/tutorial/Makefile
Normal file
@@ -0,0 +1,28 @@
|
||||
##===- docs/tutorial/Makefile ------------------------------*- Makefile -*-===##
|
||||
#
|
||||
# The LLVM Compiler Infrastructure
|
||||
#
|
||||
# This file is distributed under the University of Illinois Open Source
|
||||
# License. See LICENSE.TXT for details.
|
||||
#
|
||||
##===----------------------------------------------------------------------===##
|
||||
|
||||
LEVEL := ../..
|
||||
include $(LEVEL)/Makefile.common
|
||||
|
||||
HTML := $(wildcard $(PROJ_SRC_DIR)/*.html)
|
||||
EXTRA_DIST := $(HTML) index.html
|
||||
HTML_DIR := $(DESTDIR)$(PROJ_docsdir)/html/tutorial
|
||||
|
||||
install-local:: $(HTML)
|
||||
$(Echo) Installing HTML Tutorial Documentation
|
||||
$(Verb) $(MKDIR) $(HTML_DIR)
|
||||
$(Verb) $(DataInstall) $(HTML) $(HTML_DIR)
|
||||
$(Verb) $(DataInstall) $(PROJ_SRC_DIR)/index.html $(HTML_DIR)
|
||||
|
||||
uninstall-local::
|
||||
$(Echo) Uninstalling Tutorial Documentation
|
||||
$(Verb) $(RM) -rf $(HTML_DIR)
|
||||
|
||||
printvars::
|
||||
$(Echo) "HTML : " '$(HTML)'
|
365
docs/tutorial/OCamlLangImpl1.html
Normal file
365
docs/tutorial/OCamlLangImpl1.html
Normal file
@@ -0,0 +1,365 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
||||
"http://www.w3.org/TR/html4/strict.dtd">
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>Kaleidoscope: Tutorial Introduction and the Lexer</title>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
||||
<meta name="author" content="Chris Lattner">
|
||||
<meta name="author" content="Erick Tryzelaar">
|
||||
<link rel="stylesheet" href="../llvm.css" type="text/css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="doc_title">Kaleidoscope: Tutorial Introduction and the Lexer</div>
|
||||
|
||||
<ul>
|
||||
<li><a href="index.html">Up to Tutorial Index</a></li>
|
||||
<li>Chapter 1
|
||||
<ol>
|
||||
<li><a href="#intro">Tutorial Introduction</a></li>
|
||||
<li><a href="#language">The Basic Language</a></li>
|
||||
<li><a href="#lexer">The Lexer</a></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><a href="OCamlLangImpl2.html">Chapter 2</a>: Implementing a Parser and
|
||||
AST</li>
|
||||
</ul>
|
||||
|
||||
<div class="doc_author">
|
||||
<p>
|
||||
Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>
|
||||
and <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a>
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
|
||||
runs through the implementation of a simple language, showing how fun and
|
||||
easy it can be. This tutorial will get you up and started as well as help to
|
||||
build a framework you can extend to other languages. The code in this tutorial
|
||||
can also be used as a playground to hack on other LLVM specific things.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The goal of this tutorial is to progressively unveil our language, describing
|
||||
how it is built up over time. This will let us cover a fairly broad range of
|
||||
language design and LLVM-specific usage issues, showing and explaining the code
|
||||
for it all along the way, without overwhelming you with tons of details up
|
||||
front.</p>
|
||||
|
||||
<p>It is useful to point out ahead of time that this tutorial is really about
|
||||
teaching compiler techniques and LLVM specifically, <em>not</em> about teaching
|
||||
modern and sane software engineering principles. In practice, this means that
|
||||
we'll take a number of shortcuts to simplify the exposition. For example, the
|
||||
code leaks memory, uses global variables all over the place, doesn't use nice
|
||||
design patterns like <a
|
||||
href="http://en.wikipedia.org/wiki/Visitor_pattern">visitors</a>, etc... but it
|
||||
is very simple. If you dig in and use the code as a basis for future projects,
|
||||
fixing these deficiencies shouldn't be hard.</p>
|
||||
|
||||
<p>I've tried to put this tutorial together in a way that makes chapters easy to
|
||||
skip over if you are already familiar with or are uninterested in the various
|
||||
pieces. The structure of the tutorial is:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><b><a href="#language">Chapter #1</a>: Introduction to the Kaleidoscope
|
||||
language, and the definition of its Lexer</b> - This shows where we are going
|
||||
and the basic functionality that we want it to do. In order to make this
|
||||
tutorial maximally understandable and hackable, we choose to implement
|
||||
everything in Objective Caml instead of using lexer and parser generators.
|
||||
LLVM obviously works just fine with such tools, feel free to use one if you
|
||||
prefer.</li>
|
||||
<li><b><a href="OCamlLangImpl2.html">Chapter #2</a>: Implementing a Parser and
|
||||
AST</b> - With the lexer in place, we can talk about parsing techniques and
|
||||
basic AST construction. This tutorial describes recursive descent parsing and
|
||||
operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific,
|
||||
the code doesn't even link in LLVM at this point. :)</li>
|
||||
<li><b><a href="OCamlLangImpl3.html">Chapter #3</a>: Code generation to LLVM
|
||||
IR</b> - With the AST ready, we can show off how easy generation of LLVM IR
|
||||
really is.</li>
|
||||
<li><b><a href="OCamlLangImpl4.html">Chapter #4</a>: Adding JIT and Optimizer
|
||||
Support</b> - Because a lot of people are interested in using LLVM as a JIT,
|
||||
we'll dive right into it and show you the 3 lines it takes to add JIT support.
|
||||
LLVM is also useful in many other ways, but this is one simple and "sexy" way
|
||||
to shows off its power. :)</li>
|
||||
<li><b><a href="OCamlLangImpl5.html">Chapter #5</a>: Extending the Language:
|
||||
Control Flow</b> - With the language up and running, we show how to extend it
|
||||
with control flow operations (if/then/else and a 'for' loop). This gives us a
|
||||
chance to talk about simple SSA construction and control flow.</li>
|
||||
<li><b><a href="OCamlLangImpl6.html">Chapter #6</a>: Extending the Language:
|
||||
User-defined Operators</b> - This is a silly but fun chapter that talks about
|
||||
extending the language to let the user program define their own arbitrary
|
||||
unary and binary operators (with assignable precedence!). This lets us build a
|
||||
significant piece of the "language" as library routines.</li>
|
||||
<li><b><a href="OCamlLangImpl7.html">Chapter #7</a>: Extending the Language:
|
||||
Mutable Variables</b> - This chapter talks about adding user-defined local
|
||||
variables along with an assignment operator. The interesting part about this
|
||||
is how easy and trivial it is to construct SSA form in LLVM: no, LLVM does
|
||||
<em>not</em> require your front-end to construct SSA form!</li>
|
||||
<li><b><a href="OCamlLangImpl8.html">Chapter #8</a>: Conclusion and other
|
||||
useful LLVM tidbits</b> - This chapter wraps up the series by talking about
|
||||
potential ways to extend the language, but also includes a bunch of pointers to
|
||||
info about "special topics" like adding garbage collection support, exceptions,
|
||||
debugging, support for "spaghetti stacks", and a bunch of other tips and
|
||||
tricks.</li>
|
||||
|
||||
</ul>
|
||||
|
||||
<p>By the end of the tutorial, we'll have written a bit less than 700 lines of
|
||||
non-comment, non-blank, lines of code. With this small amount of code, we'll
|
||||
have built up a very reasonable compiler for a non-trivial language including
|
||||
a hand-written lexer, parser, AST, as well as code generation support with a JIT
|
||||
compiler. While other systems may have interesting "hello world" tutorials,
|
||||
I think the breadth of this tutorial is a great testament to the strengths of
|
||||
LLVM and why you should consider it if you're interested in language or compiler
|
||||
design.</p>
|
||||
|
||||
<p>A note about this tutorial: we expect you to extend the language and play
|
||||
with it on your own. Take the code and go crazy hacking away at it, compilers
|
||||
don't need to be scary creatures - it can be a lot of fun to play with
|
||||
languages!</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="language">The Basic Language</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>This tutorial will be illustrated with a toy language that we'll call
|
||||
"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>" (derived
|
||||
from "meaning beautiful, form, and view").
|
||||
Kaleidoscope is a procedural language that allows you to define functions, use
|
||||
conditionals, math, etc. Over the course of the tutorial, we'll extend
|
||||
Kaleidoscope to support the if/then/else construct, a for loop, user defined
|
||||
operators, JIT compilation with a simple command line interface, etc.</p>
|
||||
|
||||
<p>Because we want to keep things simple, the only datatype in Kaleidoscope is a
|
||||
64-bit floating point type (aka 'float' in O'Caml parlance). As such, all
|
||||
values are implicitly double precision and the language doesn't require type
|
||||
declarations. This gives the language a very nice and simple syntax. For
|
||||
example, the following simple example computes <a
|
||||
href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
# Compute the x'th fibonacci number.
|
||||
def fib(x)
|
||||
if x < 3 then
|
||||
1
|
||||
else
|
||||
fib(x-1)+fib(x-2)
|
||||
|
||||
# This expression will compute the 40th number.
|
||||
fib(40)
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
|
||||
JIT makes this completely trivial). This means that you can use the 'extern'
|
||||
keyword to define a function before you use it (this is also useful for mutually
|
||||
recursive functions). For example:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
extern sin(arg);
|
||||
extern cos(arg);
|
||||
extern atan2(arg1 arg2);
|
||||
|
||||
atan2(sin(.4), cos(42))
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>A more interesting example is included in Chapter 6 where we write a little
|
||||
Kaleidoscope application that <a href="OCamlLangImpl6.html#example">displays
|
||||
a Mandelbrot Set</a> at various levels of magnification.</p>
|
||||
|
||||
<p>Lets dive into the implementation of this language!</p>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<div class="doc_section"><a name="lexer">The Lexer</a></div>
|
||||
<!-- *********************************************************************** -->
|
||||
|
||||
<div class="doc_text">
|
||||
|
||||
<p>When it comes to implementing a language, the first thing needed is
|
||||
the ability to process a text file and recognize what it says. The traditional
|
||||
way to do this is to use a "<a
|
||||
href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
|
||||
to break the input up into "tokens". Each token returned by the lexer includes
|
||||
a token code and potentially some metadata (e.g. the numeric value of a number).
|
||||
First, we define the possibilities:
|
||||
</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
|
||||
* these others for known things. *)
|
||||
type token =
|
||||
(* commands *)
|
||||
| Def | Extern
|
||||
|
||||
(* primary *)
|
||||
| Ident of string | Number of float
|
||||
|
||||
(* unknown *)
|
||||
| Kwd of char
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>Each token returned by our lexer will be one of the token variant values.
|
||||
An unknown character like '+' will be returned as <tt>Token.Kwd '+'</tt>. If
|
||||
the curr token is an identifier, the value will be <tt>Token.Ident s</tt>. If
|
||||
the current token is a numeric literal (like 1.0), the value will be
|
||||
<tt>Token.Number 1.0</tt>.
|
||||
</p>
|
||||
|
||||
<p>The actual implementation of the lexer is a collection of functions driven
|
||||
by a function named <tt>Lexer.lex</tt>. The <tt>Lexer.lex</tt> function is
|
||||
called to return the next token from standard input. We will use
|
||||
<a href="http://caml.inria.fr/pub/docs/manual-camlp4/index.html">Camlp4</a>
|
||||
to simplify the tokenization of the standard input. Its definition starts
|
||||
as:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(*===----------------------------------------------------------------------===
|
||||
* Lexer
|
||||
*===----------------------------------------------------------------------===*)
|
||||
|
||||
let rec lex = parser
|
||||
(* Skip any whitespace. *)
|
||||
| [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>
|
||||
<tt>Lexer.lex</tt> works by recursing over a <tt>char Stream.t</tt> to read
|
||||
characters one at a time from the standard input. It eats them as it recognizes
|
||||
them and stores them in in a <tt>Token.token</tt> variant. The first thing that
|
||||
it has to do is ignore whitespace between tokens. This is accomplished with the
|
||||
recursive call above.</p>
|
||||
|
||||
<p>The next thing <tt>Lexer.lex</tt> needs to do is recognize identifiers and
|
||||
specific keywords like "def". Kaleidoscope does this with a pattern match
|
||||
and a helper function.<p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(* identifier: [a-zA-Z][a-zA-Z0-9] *)
|
||||
| [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
|
||||
let buffer = Buffer.create 1 in
|
||||
Buffer.add_char buffer c;
|
||||
lex_ident buffer stream
|
||||
|
||||
...
|
||||
|
||||
and lex_ident buffer = parser
|
||||
| [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
|
||||
Buffer.add_char buffer c;
|
||||
lex_ident buffer stream
|
||||
| [< stream=lex >] ->
|
||||
match Buffer.contents buffer with
|
||||
| "def" -> [< 'Token.Def; stream >]
|
||||
| "extern" -> [< 'Token.Extern; stream >]
|
||||
| id -> [< 'Token.Ident id; stream >]
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>Numeric values are similar:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(* number: [0-9.]+ *)
|
||||
| [< ' ('0' .. '9' as c); stream >] ->
|
||||
let buffer = Buffer.create 1 in
|
||||
Buffer.add_char buffer c;
|
||||
lex_number buffer stream
|
||||
|
||||
...
|
||||
|
||||
and lex_number buffer = parser
|
||||
| [< ' ('0' .. '9' | '.' as c); stream >] ->
|
||||
Buffer.add_char buffer c;
|
||||
lex_number buffer stream
|
||||
| [< stream=lex >] ->
|
||||
[< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>This is all pretty straight-forward code for processing input. When reading
|
||||
a numeric value from input, we use the ocaml <tt>float_of_string</tt> function
|
||||
to convert it to a numeric value that we store in <tt>Token.Number</tt>. Note
|
||||
that this isn't doing sufficient error checking: it will raise <tt>Failure</tt>
|
||||
if the string "1.23.45.67". Feel free to extend it :). Next we handle
|
||||
comments:
|
||||
</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(* Comment until end of line. *)
|
||||
| [< ' ('#'); stream >] ->
|
||||
lex_comment stream
|
||||
|
||||
...
|
||||
|
||||
and lex_comment = parser
|
||||
| [< ' ('\n'); stream=lex >] -> stream
|
||||
| [< 'c; e=lex_comment >] -> e
|
||||
| [< >] -> [< >]
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>We handle comments by skipping to the end of the line and then return the
|
||||
next token. Finally, if the input doesn't match one of the above cases, it is
|
||||
either an operator character like '+' or the end of the file. These are handled
|
||||
with this code:</p>
|
||||
|
||||
<div class="doc_code">
|
||||
<pre>
|
||||
(* Otherwise, just return the character as its ascii value. *)
|
||||
| [< 'c; stream >] ->
|
||||
[< 'Token.Kwd c; lex stream >]
|
||||
|
||||
(* end of stream. *)
|
||||
| [< >] -> [< >]
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<p>With this, we have the complete lexer for the basic Kaleidoscope language
|
||||
(the <a href="OCamlLangImpl2.html#code">full code listing</a> for the Lexer is
|
||||
available in the <a href="OCamlLangImpl2.html">next chapter</a> of the
|
||||
tutorial). Next we'll <a href="OCamlLangImpl2.html">build a simple parser that
|
||||
uses this to build an Abstract Syntax Tree</a>. When we have that, we'll
|
||||
include a driver so that you can use the lexer and parser together.
|
||||
</p>
|
||||
|
||||
<a href="OCamlLangImpl2.html">Next: Implementing a Parser and AST</a>
|
||||
</div>
|
||||
|
||||
<!-- *********************************************************************** -->
|
||||
<hr>
|
||||
<address>
|
||||
<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
|
||||
src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
|
||||
<a href="http://validator.w3.org/check/referer"><img
|
||||
src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
|
||||
|
||||
<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
|
||||
<a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a><br>
|
||||
<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
|
||||
Last modified: $Date$
|
||||
</address>
|
||||
</body>
|
||||
</html>
|
1045
docs/tutorial/OCamlLangImpl2.html
Normal file
1045
docs/tutorial/OCamlLangImpl2.html
Normal file
File diff suppressed because it is too large
Load Diff
1093
docs/tutorial/OCamlLangImpl3.html
Normal file
1093
docs/tutorial/OCamlLangImpl3.html
Normal file
File diff suppressed because it is too large
Load Diff
1029
docs/tutorial/OCamlLangImpl4.html
Normal file
1029
docs/tutorial/OCamlLangImpl4.html
Normal file
File diff suppressed because it is too large
Load Diff
1569
docs/tutorial/OCamlLangImpl5.html
Normal file
1569
docs/tutorial/OCamlLangImpl5.html
Normal file
File diff suppressed because it is too large
Load Diff
1574
docs/tutorial/OCamlLangImpl6.html
Normal file
1574
docs/tutorial/OCamlLangImpl6.html
Normal file
File diff suppressed because it is too large
Load Diff
1907
docs/tutorial/OCamlLangImpl7.html
Normal file
1907
docs/tutorial/OCamlLangImpl7.html
Normal file
File diff suppressed because it is too large
Load Diff
48
docs/tutorial/index.html
Normal file
48
docs/tutorial/index.html
Normal file
@@ -0,0 +1,48 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
||||
"http://www.w3.org/TR/html4/strict.dtd">
|
||||
<html>
|
||||
<head>
|
||||
<title>LLVM Tutorial: Table of Contents</title>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
||||
<meta name="author" content="Owen Anderson">
|
||||
<meta name="description"
|
||||
content="LLVM Tutorial: Table of Contents.">
|
||||
<link rel="stylesheet" href="../llvm.css" type="text/css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="doc_title"> LLVM Tutorial: Table of Contents </div>
|
||||
|
||||
<ol>
|
||||
<li>Kaleidoscope: Implementing a Language with LLVM
|
||||
<ol>
|
||||
<li><a href="LangImpl1.html">Tutorial Introduction and the Lexer</a></li>
|
||||
<li><a href="LangImpl2.html">Implementing a Parser and AST</a></li>
|
||||
<li><a href="LangImpl3.html">Implementing Code Generation to LLVM IR</a></li>
|
||||
<li><a href="LangImpl4.html">Adding JIT and Optimizer Support</a></li>
|
||||
<li><a href="LangImpl5.html">Extending the language: control flow</a></li>
|
||||
<li><a href="LangImpl6.html">Extending the language: user-defined operators</a></li>
|
||||
<li><a href="LangImpl7.html">Extending the language: mutable variables / SSA construction</a></li>
|
||||
<li><a href="LangImpl8.html">Conclusion and other useful LLVM tidbits</a></li>
|
||||
</ol></li>
|
||||
<li>Kaleidoscope: Implementing a Language with LLVM in Objective Caml
|
||||
<ol>
|
||||
<li><a href="OCamlLangImpl1.html">Tutorial Introduction and the Lexer</a></li>
|
||||
<li><a href="OCamlLangImpl2.html">Implementing a Parser and AST</a></li>
|
||||
<li><a href="OCamlLangImpl3.html">Implementing Code Generation to LLVM IR</a></li>
|
||||
<li><a href="OCamlLangImpl4.html">Adding JIT and Optimizer Support</a></li>
|
||||
<li><a href="OCamlLangImpl5.html">Extending the language: control flow</a></li>
|
||||
<li><a href="OCamlLangImpl6.html">Extending the language: user-defined operators</a></li>
|
||||
<li><a href="OCamlLangImpl7.html">Extending the language: mutable variables / SSA construction</a></li>
|
||||
<li><a href="LangImpl8.html">Conclusion and other useful LLVM tidbits</a></li>
|
||||
</ol></li>
|
||||
<li>Advanced Topics
|
||||
<ol>
|
||||
<li><a href="http://llvm.org/pubs/2004-09-22-LCPCLLVMTutorial.html">Writing
|
||||
an Optimization for LLVM</a></li>
|
||||
</ol></li>
|
||||
</ol>
|
||||
|
||||
</body>
|
||||
</html>
|
Reference in New Issue
Block a user