mirror of
https://github.com/sehugg/8bitworkshop.git
synced 2024-09-29 06:55:37 +00:00
81 lines
4.2 KiB
Markdown
81 lines
4.2 KiB
Markdown
|
# BASIC Compiler Internals
|
||
|
|
||
|
If you want to know more about the internals of a BASIC compiler written in TypeScript, then read on.
|
||
|
|
||
|
## Tokenizer
|
||
|
|
||
|
The tokenizer is powered by one huge gnarly regular expression.
|
||
|
Each token type is a separate capture group, and we just look for the first one that matched.
|
||
|
Here's a sample of the regex:
|
||
|
|
||
|
~~~
|
||
|
comment identifier string
|
||
|
... (['].*) | ([A-Z_]\w*[$]?) | (".*?") ...
|
||
|
~~~
|
||
|
|
||
|
In some tokenizers, like Microsoft BASIC, each keyword supported by the language is matched individually,
|
||
|
so whitespace is not required around keywords.
|
||
|
For example, `FORI=ATOB` would be matched `FOR I = A TO B`.
|
||
|
This was sometimes called "crunching."
|
||
|
We have a special case in the tokenizer to enable this for these dialects.
|
||
|
|
||
|
The tokenizer also has special cases for `REM`, `DATA`, and `OPTION` which require tokens be untouched
|
||
|
-- and in the case of `DATA`, whitespace preserved.
|
||
|
|
||
|
Since BASIC is a line-oriented language, the tokenizer operates on one line at a time,
|
||
|
and each line is then fed to the parser.
|
||
|
For block-oriented languages, we'd probably want to tokenize the entire file before the parsing stage.
|
||
|
|
||
|
|
||
|
## Parser
|
||
|
|
||
|
The parser is a hand-coded recursive descent parser.
|
||
|
Why?
|
||
|
There was no `yacc` nor `bison` when BASIC was invented, so the language was not designed for these tools.
|
||
|
In fact, BASIC is a little gnarly when you get into the details, so having a bit of control is nice,
|
||
|
and error messages can be more informative.
|
||
|
Both clang and gcc use recursive descent parsers, so it can't be that bad, right?
|
||
|
|
||
|
The program is parsed one line at a time.
|
||
|
After line tokenization, the tokens go into an array.
|
||
|
We can consume tokens (remove from the list), peek at tokens (check the next token without removing), and pushback (return them to the list).
|
||
|
We don't have to check for `null`; we will always get the EOL (end-of-line) empty-string token if we run out.
|
||
|
|
||
|
Expressions are parsed with an [operator-precedence parser](https://en.wikipedia.org/wiki/Operator-precedence_parser#Pseudocode), which isn't really that complicated.
|
||
|
We also infer types at this type (number or string).
|
||
|
We have a list of function signatures, and we know that "$" means a string variable, so we can check types.
|
||
|
The tricky part is that `FNA(X)` is a user-defined function, while `INT(X)` is a function, and `I1(X)` could be a dimensioned array.
|
||
|
|
||
|
Tokens carry their source code location with them, so we can assign a precise source code location to each statement.
|
||
|
This is used for error messages and for debugging.
|
||
|
|
||
|
The compiler generates an AST (Abstract Syntax Tree) and not any kind of VM bytecode.
|
||
|
The top-level of the AST is a list of statements, and an associated mapping of labels (line numbers) to statements.
|
||
|
AST nodes must refer to other nodes by index, not by reference, as the worker transfers it to the main thread using `JSON.stringify()`.
|
||
|
|
||
|
|
||
|
## Runtime
|
||
|
|
||
|
The runtime interprets the AST generated by the compiler.
|
||
|
It compiles each statement (PRINT, LET, etc.) into JavaScript.
|
||
|
The methods `expr2js()` converts expression trees to JavaScript, and `assign2js()` handles assignments like `LET`, `READ` and `INPUT`.
|
||
|
|
||
|
One statement is executed every time step.
|
||
|
There's a "program counter", which is the index of the next-to-run Statement node in the list.
|
||
|
|
||
|
Early BASICs were compiled languages, but the most popular BASICs for microcomputers were tokenized and interpreted.
|
||
|
There are subtle differences between the two.
|
||
|
For example, interpreted BASIC supports NEXT statements without a variable,
|
||
|
which always jump back to the most recent FOR even if you GOTO a different NEXT.
|
||
|
This requires the runtime maintain a stack of FOR loops.
|
||
|
Compiled BASIC dialects will verify loop structures at compile time.
|
||
|
|
||
|
For INPUT commands, the runtime calls the `input()` method, which returns a Promise.
|
||
|
The IDE overriddes this method to show a text field to the user, and resolve the Promise when data is entered.
|
||
|
The runtime might call multiple times until valid data is entered.
|
||
|
|
||
|
The compiler and runtime are each about [1300 lines of TypeScript](https://github.com/sehugg/8bitworkshop/tree/master/src/common/basic),
|
||
|
excluding the definitions of the BASIC dialects.
|
||
|
It's tested with a [test suite](https://github.com/sehugg/nbs-ecma55-test)
|
||
|
and with a [coverage-guided fuzzer](https://github.com/fuzzitdev/jsfuzz).
|