Tokenizer Overview

Generally, the usage pattern is:

  1. Setup the Configuration.
  2. Read the tokens.
  3. Parse the tokens into a Program.
  4. Apply transformations, if applicable.

Code snippets

Queue<Token> tokens = TokenReader.tokenize(config.sourceFile);

The list of tokens is a loose interpretation. It includes more of a compiler sense of tokens -- numbers, end of line markers (they're significant), AppleSoft tokens, strings, comments, identifiers, etc.

Parser parser = new Parser(tokens);
Program program = parser.parse();

The Program is now the parsed version of the BASIC program. Various Visitors may be used to report, gather information, or manipulate the tree in various ways.

Configuration config = Configuration.builder()

The Configuration class also allows the BASIC start address to be set (defaults to 0x801), set the maximum line length (this is in bytes, and defaults to 255, but feel free to experiment). Some of the classes report output via the debug stream, which defaults to a simple null stream (no output) - replace with System.out or another PrintStream.

ByteVisitor byteVisitor = Visitors.byteVisitor(config);
byte[] programData = byteVisitor.dump(program);

Finally, the ByteVisitor will transform the program into the tokenized form.


The framework allows embedding of directives.


NOTE: It appears that DOS 3.3 rewrites the resulting application and messes up the linked list of lines. ProDOS does not.

$embed will allow a binary to be embedded within the resulting application and can move it to a destination in memory. Please note that once the application is loaded on the Apple II, the program cannot be altered as the computer will crash.


  • file=<string>, required. Specifies the file to load.
  • moveto=<addr>, optional. If provided, generates code to move binary to destination. Automatically CALLed.
  • var=<variable>, optional. If provided, address is assigned to variable specified.

Note that the current parser does not handle hex formats (at all). You may provide a string as well that starts with a $ or 0x prefix.

Usage example:

5 $embed file="read.time.bin", moveto="0x0260"

The $embed directive must be last on the line (if there are comments, be sure to use the REMOVE_REM_STATEMENTS optimization.

From the circles-timing.bas sample, this is the beginning of the program:

0801:9A 09 00 00 8C 32 30 36 32 3A AB 31 00 A9 2B 85
     \___/ \___/ \____________/    \___/    \_______...
     Ptr, Line 0, CALL 2062,    :, GOTO 1,   Assembly code...     

The move code is based on what Beagle Bros put into their Peeks, Pokes, and Pointers poster. (See Memory Move under the Useful Calls; the CALL -468 entry.)

LDA #<embeddedStart
LDA #>embeddedStart
LDA #<embeddedEnd
LDA #>embeddedEnd
LDA #<targetAddress
STA $42
LDA #>targetAddress
STA $43
LDY #0


$shape will generate a shape table based either on the source (src=) or binary (bin=) shape table provided. Source shape table generation is based on the shape table st tool support and is described here in more detail.

Overall format is as follows:

$shape ( src="path" [ ,label=variable | ,assign=(varname1="label1" [,varname2="label2"]* ] ) 
       | bin="path" )
       [,init=yes|no ]

Shape from source

By using the src= option, the source code will be generated on the fly. For example the following shape source will insert a shape named "mouse" into the BASIC program:

; extracted from NEW MOUSE

.bitmap mouse

Options on the source include:

  • label=variable which indicates a label is really a variable name; in the example, the variable name would be "MOUSE".
  • assign=(...) will define a mapping from the label in the source to the BASIC variable name. A assign(m=mouse) will define the variable M to be the shape number for the mouse.

Shape from binary

By using the bin= option, an already existing binary shape table can be inserted into the code. There are no additional options available in this case.

General options

  • poke=yes|no (default=yes) will embed a POKE 232,<lowAddr>:POKE 233,<highAddr> into the line of code.
  • address=<variable>, if supplied, will assign the address to a variable; therefore a address=AD will embed the variable AD into the line of code.
  • init=yes|no (default=yes) will embed a simple ROT=0:SCALE=1 into the line of code for simple shape initialization.


If embedding hexadecimal addresses into an application makes sense, the $hex directive allows that to be done in a rudimentary manner.


10 call $hex value="fc58"


10 call -936


Optimizations are mechanisms to rewrite the Program, typically making the program smaller. Optimization itself is an enum which has a create method to setup the Visitor.

Current optimizations are:

  • Remove empty statements will remove all extra colons. For example, if the application in question used : to indicate nesting. Or just accidents!
  • Remove REM statements will remove all comments.
  • Extract constant values will find all constant numerical references, insert a line 0 with assignments, and finally replace all the numbers with the approrpiate variable name. Hypothesis is that the BASIC interpreter only parses the number once.
  • Merge lines will identify all lines that are not a target of GOTO/GOSUB-type action and rewrite the line by merging it with others. The concept involved is that the BASIC program is just a linked list and shortening the list will shorten the search path. The default max length in bytes is set to 255.
  • Renumber will renumber the application, beginning with line 0. This makes the decoding a tiny bit more efficient in that the number to decode will be smaller in the token stream.

Sample use:

program = program.accept(Optimization.REMOVE_REM_STATEMENTS.create(config));