6502bench SourceGen: Code Generation & Assembly
SourceGen can generate an assembly source file that, when fed into the target assembler, will recreate the original data file exactly. Every assembler is different, so support must be added to SourceGen for each.
The generation / assembly dialog can be opened with File > Assemble.
Supported Assemblers
SourceGen currently supports the following cross-assemblers:
Version-Specific Code Generation
Code generation must be tailored to the specific version of the assembler. This is most easily understood with an example.
If you write MVN $01,$02
, the assembler is expected to output
54 02 01
, with the arguments reversed. cc65 v2.17 doesn't
do that; this is a bug that was fixed in a later version. So if you're
generating code for v2.17, you want to create source code with the
arguments the wrong way around.
Having version-dependent source code is a bad idea, so SourceGen just outputs raw hex bytes for MVN/MVP instructions. This yields the correct code for all versions of the assembler, but is ugly and annoying. So we want to output actual MVN/MVP instructions when producing code for newer versions of the assembler.
When you configure a cross-assembler, SourceGen runs the executable with version query args, and extracts the version information from the output stream. This is used by the generator to ensure that the output will compile. If no assembler is configured, SourceGen will produce code optimized for the latest version of the assembler.
Generating Source Code
Cross assemblers tend to generate additional files, either compiler intermediaries ("file.o") or metadata ("_FileInformation.txt"). Some generators may produce multiple source files, perhaps a link script or symbol definition header to go with the assembly source. To avoid spreading files across the filesystem, SourceGen does all of its work in the same directory where the project lives. Before you can generate code, you have to have assigned your project a directory. This is why you can't assemble a project until you've saved it for the first time.
The Generate and Assemble dialog has a drop-down list near the top that lets you pick which assembler to target. The name of the assembler will be shown with the detected version number. If the assembler executable isn't configured, "[latest version]" will be shown instead of a version number.
The Settings button will take you directly to the assembler configuration tab in the application settings dialog.
Hit the Generate button to generate the source code into a file on disk. The file will use the project name, with the ".dis65" replaced by "_<assembler>.S".
The first 64KiB of each generated file will be shown in the preview window. If multiple files were generated, you can use the "preview file" drop-down to select between them. Line numbers are prepended to each line to make it easier to track down errors.
Label Localizer
The label localizer is an optional feature that automatically converts some labels to an assembler-specific less-than-global label format. Local labels may be reusable (e.g. using "]LOOP" for multiple consecutive loops is easier to understand than giving each one a unique label) or reduce the size of a generated link table. There are usually restrictions on local labels, e.g. references to them may not be allowed to cross a global label definition, which the localizer factors in automatically.
The localizer is somewhat experimental at this time, and can be disabled from the application settings.
Cross-Assembling Generated Code
After generating sources, if you have a cross-assembler executable configured, you can run it by clicking the "Run Assembler" button. The command-line output will be displayed, with stdout and stderr separated. (I'd prefer them to be interleaved, but that's not what the system provides.)
The output will show the assembler's exit code, which will be zero on success (note: sometimes they lie.) If it appeared to succeed, SourceGen will then compare the assembler's output to the original file, and report any differences.
Failures here may be due to bugs in the cross-assembler or in SourceGen. However, SourceGen can generally work around assembler bugs, so any failure is an opportunity for improvement.
Assembler-Specific Bugs & Quirks
This is a list of bugs and quirky behavior in cross-assemblers that SourceGen works around when generating code.
Every assembler seems to have a different way of dealing with expressions.
Most of them will let you group expressions with parenthesis, but that
doesn't always help. For example, PEA label >> 8 + 1
is
perfectly valid, but writing PEA (label >> 8) + 1
will cause
most assemblers to assume you're trying to use an alternate (and non-existent)
form of PEA
with indirect addressing, causing the assembler
to halt with an error message. The code generator needs
to understand expression syntax and operator precedence to generate correct
code, but also needs to know how to handle the corner cases.
64tass
Code is generated for 64tass v1.53.1515 or later. [web site]
Bugs:
- Undocumented opcode
SHA (ZP),Y
($93) is not supported; the assembler appears to be expectingSHA ABS,X
instead. - COP and WDM are not allowed to have operands.
Quirks:
- The underscore character ('_') is allowed as a character in labels, but when used as the first character in a label it indicates the label is local. If you create labels with leading underscores that are not local, the labels must be altered to start with some other character, and made unique.
- Labels starting with two underscores are "reserved". Trying to use them causes an error.
- By default, 64tass sets the first two bytes of the output file to
the load address. The
--nostart
flag is used to suppress this. - By default, 64tass is case-insensitive, but SourceGen treats labels
as case-sensitive. The
--case-sensitive
must be passed to the assembler. - If you set the
--case-sensitive
flag, all opcodes and operands must be lower-case. Most of the SourceGen options used to show things in upper case must be disabled. - For 65816, selecting the bank byte is done with the back-quote ('`') rather than the caret ('^'). (There's a note in the docs to the effect that they plan to move to carets.)
- By default, the assembler assumes that the input is PETSCII, but doesn't convert characters in text strings. So PETSCII source files generate PETSCII strings, and ASCII source files generate ASCII strings. However, if you use the built-in "screen" encoding, you will get the wrong behavior if you compile an ASCII source without the "--ascii" command-line flag, because it expects to convert from PETSCII. To get the behavior expected of a cross-assembler, the recommended approach seems to be to pass "--ascii" and explicitly define an ASCII encoding for use with ASCII text strings.
Notes:
- The "default text encoding" project property determines the text encoding for the entire file. For non-ASCII projects, a small encoding table is output at the top of the file. This works for C64 PETSCII and C64 screen codes, but not for high ASCII. This is done without passing "--ascii" on the command line. If the source file is converted to PETSCII, the encoding table should be removed.
ACME
Code is generated for ACME v0.96.4 or later. [web site]
Bugs:
- The "pseudo PC" is only 16 bits, so any 65816 code targeted to run outside bank zero cannot be assembled. SourceGen currently deals with this by outputting the entire file as a hex dump.
- Undocumented opcode $AB (
LAX #imm
) generates an error. - WDM is not allowed to have an operand.
Quirks:
- The assembler shares some traits with one-pass assemblers. In particular, if you forward-reference a zero-page label, the reference generates a 16-bit absolute address instead of an 8-bit zero-page address. Unlike other one-pass assemblers, the width is "sticky", and backward references appearing later in the file also use absolute addressing even though the proper width is known at that point. This is worked around by using explicit "force zero page" annotations on all references to zero-page labels.
- Undocumented opcode
ALR
($4b) uses mnemonicASR
instead. - Does not allow the accumulator to be specified explicitly as an
operand, e.g. you can't write
LSR A
. - Syntax for
MVN
/MVP
doesn't allow '#' before 8-bit operands. - Officially, the preferred file extension for ACME source code is ".a", but this is already used on UNIX systems for static libraries (which means shell filename completion tends to ignore them). Since ".S" is pretty universally recognized as assembly source, code generated by SourceGen for ACME also uses ".S".
cc65
Code is generated for cc65 v2.17 or v2.18. [web site]
Bugs:
- PC relative branches don't wrap around at bank boundaries.
- [Fixed in v2.18] The arguments to
MVN
/MVP
are reversed. - [Fixed in v2.18]
BRK <arg>
is assembled to opcode $05 rather than $00. - [Fixed in v2.18]
WDM
is not supported.
Quirks:
- Operator precedence is unusual. Consider
label >> 8 - 16
. cc65 puts shift higher than subtraction, whereas languages like C and assemblers like 64tass do it the other way around. So cc65 regards the expression as(label >> 8) - 16
, while the more common interpretation would belabel >> (8 - 16)
. (This is actually somewhat convenient, since none of the expressions SourceGen currently generates require parenthesis.) - Undocumented opcode
SBX
($cb) uses the mnemonic AXS. All other opcodes match up with the "unintended opcodes" document. - ca65 is implemented as a single-pass assembler, so label widths
can't always be known in time. For example, if you use some zero-page
labels, but they're defined via
.ORG $0000
after the point where the labels are used, the assembler will already have generated them as absolute values. Width disambiguation must be applied to operands that wouldn't be ambiguous to a multi-pass assembler. - The assembler is geared toward generating relocatable code with multiple segments (it is, after all, an assembler for a C compiler). A linker configuration script is expected to be provided for anything complex. SourceGen generates a custom config file for each project.
Merlin 32
Code is generated for Merlin 32 v1.0. [web site] [bug tracker]
Bugs:
- PC relative branches don't wrap around at bank boundaries.
- For some failures, an exit code of zero is returned.
- Some DP indexed store instructions cause errors if the label isn't
unambiguously DP (e.g.
STX $00,X
vs.STX $0000,X
). This isn't a problem with project/platform symbols, which are output as two-digit hex values when possible, but causes failures when direct page locations are included in the project and given labels. - The check for 64KiB overflow appears to happen before instructions that might be absolute or direct page are resolved and reduced in size. This makes it unlikely that a full 64KiB bank of code can be assembled.
Quirks:
- Operator precedence is unusual. Expressions are generally processed from left to right. The byte-selection operators have a lower precedence than all of the others, and so are always processed last.
- The byte selection operators ('<', '>', '^') are actually word-selection operators, yielding 16-bit values when wide registers are enabled on the 65816.
- Values loaded into registers are implicitly mod 256 or 65536. There is no need to explicitly mask an expression.
- The assembler tracks register widths when it sees SEP/REP instructions,
but doesn't attempt to track the emulation flag. So if you issue a
REP #$20
while in emulation mode, the assembler will incorrectly assume long registers. Ideally it would be possible to configure that off, but there's no way to do that, so instead we occasionally generate additional width directives. - Non-unique local labels should cause an error, but don't.
- No undocumented opcodes are supported.