mirror of
https://github.com/fadden/6502bench.git
synced 2025-01-23 04:30:48 +00:00
2a2aadffec
Update documentation. Add some information about OMF relocation data as well. Fix bug in B=K handling.
980 lines
43 KiB
HTML
980 lines
43 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<head>
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
<title>Intro - 6502bench SourceGen</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="content">
|
|
<h1>6502bench SourceGen: Intro</h1>
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
<h2><a name="overview">Overview</a></h2>
|
|
|
|
<p>SourceGen converts 6502/65C02/65816 machine-language programs to
|
|
assembly-language source.</p>
|
|
|
|
<p>SourceGen has two purposes. The first is to be a really nice
|
|
disassembler for the 6502 and related CPUs. Code tracing with status
|
|
flag tracking makes it easier to separate the code from the data,
|
|
automatic formatting of character strings and filled-data areas helps
|
|
get the data regions sorted out, and modern IDE-style features like
|
|
cross-reference generation and color-highlighted bookmarks help
|
|
navigate the code while trying to figure out what it does. A
|
|
disassembler should help you understand the code, not just dump the
|
|
instructions to a text file.</p>
|
|
<p>The computer I built back in 2014 has a 4GHz CPU and 8GB of RAM. I
|
|
figured we should put the power of modern computing hardware to good use.</p>
|
|
|
|
<p>The second purpose is to facilitate sharing and collaboration. Most
|
|
disassemblers generate output for a specific assembler, or in a way that's
|
|
generic enough to match most any assembler; either way, you're left with
|
|
a text file in somebody's idea of the "correct" format. SourceGen keeps
|
|
everything in an assembler-neutral format, and provides numerous options
|
|
for customizing the display, so that multiple people viewing the same
|
|
project can each do so with the conventions they are accustomed to.
|
|
Code and data operands can be formatted in various numeric formats or
|
|
as symbols.
|
|
The project file uses a text format that is fairly diff-friendly, so
|
|
sharing projects through git works reasonably well. If you want source
|
|
code you can assemble, SourceGen will generate code optimized for the
|
|
assembler of your choice.</p>
|
|
|
|
<p>The sharing and collaboration ideas only work if the formatting
|
|
capabilities within SourceGen are sufficiently flexible. If you need to
|
|
generate assembly source and tweak it a bunch to express the intent of
|
|
the original code, then passing a SourceGen project around won't work.
|
|
This sort of thing is a bit outside the bounds of what a typical
|
|
disassembler does, so it remains to be seen whether SourceGen succeeds at
|
|
what it's trying to do, and also whether what it's trying to do is
|
|
something that people actually want.</p>
|
|
|
|
<p>You can get started by watching the
|
|
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and playing with the
|
|
<a href="tutorials.html">tutorials</a>.</p>
|
|
|
|
|
|
<h2><a name="fundamental-concepts">Fundamental Concepts</a></h2>
|
|
|
|
<p>The next few sections present some general concepts and terminology. The
|
|
rest of the documentation assumes you've read and understood this.</p>
|
|
<p>It will be helpful if you already understand something about the 6502
|
|
instruction set and assembly-language programming, but disassembling
|
|
other programs is actually a pretty good way to learn how to code in
|
|
assembly. You will need to be familiar with hexadecimal numbers and
|
|
general programming concepts to make sense of this, however.</p>
|
|
|
|
<h2><a name="begin">About 6502 Code</a></h2>
|
|
|
|
<p>For brevity's sake, "6502 code" should be taken to mean "code for
|
|
the 6502 CPU or any of its derivatives, including but not limited to
|
|
the 65C02 and 65816". So let's talk about 6502 code.</p>
|
|
|
|
<p>Code usually arrives in a big binary blob. Some of it will be
|
|
instructions, some of it will be data, some will be empty space used
|
|
for variable storage. Part of the challenge of disassembly is
|
|
identifying which parts of the file contain which.</p>
|
|
|
|
<p>Much of the code you'll find for the 6502 was written by humans,
|
|
rather than generated by a compiler, which means it won't conform to a
|
|
standard set of conventions. However, most programmers will use
|
|
subroutines, which can be identified and analyzed in isolation. Subroutines
|
|
are often interspersed with variable storage, referred to as a "stash".
|
|
Variables and constants may be single-byte or multi-byte, the latter
|
|
typically in little-endian byte order.</p>
|
|
|
|
<p>Much of the data in a typical program is read-only, often in the
|
|
form of graphics or character string data. Graphics can be difficult
|
|
to recognize automatically, but strings can be identified with a
|
|
reasonable degree of confidence. Address tables, which are a collection
|
|
of addresses to other things, are also fairly common.</p>
|
|
|
|
<p>A simple disassembler would start at the top of the file and just
|
|
start converting bytes to instructions. Unfortunately there's no reliable
|
|
way to tell the difference between instructions, data, and variable
|
|
stashes. When the converter hits data bytes it'll start generating
|
|
instructions that won't make sense. You'll have another problem when the
|
|
data ends and code resumes: 6502 instructions are variable-length, so if
|
|
the last byte of the data area appears to be a three-byte instruction,
|
|
the first two bytes of the next instruction area will be gobbled up.</p>
|
|
|
|
<p>To make things even more difficult (sometimes deliberately), programmers
|
|
will sometimes use a trick where they "embed" an instruction
|
|
inside another instruction. This allows code to branch to two different
|
|
entry points, one of which will set a flag or load a register, and then
|
|
continue on to common code.</p>
|
|
|
|
<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
|
|
The called subroutine pulls the caller's address off the stack, uses it to
|
|
access the parameters, then pushes the address back on after modifying it to
|
|
point to an address past the inline data. This can be very confusing
|
|
for the disassembler, which will try to interpret the inline data as
|
|
instructions.</p>
|
|
|
|
<p>Sometimes code is loaded at one location, then moved to another and
|
|
executed there. If you're disassembling an executing program you don't
|
|
have to worry about this, but if you're disassembling the binary from the
|
|
loadable file on disk then you need to track the address changes. The
|
|
address is communicated to the assembler with a "pseudo-opcode", usually
|
|
something like "ORG". Other pseudo-op directives are used to define external
|
|
symbols and (for 65816 code) register widths.</p>
|
|
|
|
<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
|
|
range from $0000 to $ffff. (I'm going to write hex values with a
|
|
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
|
|
that's what 6502 systems commonly used.) The 65816 has a 24-bit address
|
|
space, but it's not contiguous -- a branch that extends past the end will
|
|
wrap around to the start of the 64KiB "bank". For 16-bit instruction
|
|
operands, the bank is identified for instruction and data addresses
|
|
by the program bank register and the data bank register, respectively.
|
|
The disassembler can't always discern the value of the data bank
|
|
register through static analysis, so some user input may be required.</p>
|
|
|
|
<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
|
|
in it. Some of the flags determine whether a conditional branch is taken
|
|
or not, which is important because some branches appear to be conditional
|
|
but actually are always or never taken in practice. The disassembler needs
|
|
to be able to figure this out so that it doesn't try to disassemble the
|
|
bytes that follow an always-taken branch.
|
|
A more significant concern is the M and X flags found on the 65802/65816,
|
|
which determine the width of the registers and of immediate load
|
|
instructions. If you don't know what state the flags are in, you can't
|
|
know whether <code>LDA #value</code> is two bytes or three, and the
|
|
disassembly of the instruction stream will come out wrong.</p>
|
|
|
|
<p>Some addresses correspond to memory-mapped I/O, rather than RAM or ROM.
|
|
Accessing the address can have side effects, like changing between text
|
|
and graphics modes. Sometimes reading and writing have different effects.
|
|
For example, on later models of the Apple II, reading from
|
|
$C000 returns the most recently hit key, while writing to $C000 changes
|
|
how 80-column display memory is mapped.</p>
|
|
<p>On a few systems, such as the Atari 2600, RAM, ROM, and registers can
|
|
appear at multiple locations, "mirrored" across the address space.</p>
|
|
|
|
<h3><a name="charenc">Character Encoding</a></h3>
|
|
|
|
<p>The American Standard Code for Information Interchange (ASCII) was
|
|
developed in the 1960s, and became widely used as the means for representing
|
|
text data on a computer. It's compatible with Unicode, in that the
|
|
binary representation of an ASCII string is exactly the same when
|
|
expressed as a Unicode string with UTF-8 encoding.</p>
|
|
<p>Not all 6502-based computers used ASCII, notably those from Commodore
|
|
International (e.g. PET, VIC-20, 64, 128), which used variants
|
|
collectively known as "PETSCII". PETSCII had most of the same symbols,
|
|
but rearranged them, and added a number of graphical symbols. This was
|
|
further complicated by the use of two different character sets, one of
|
|
which dropped lower-case letters in favor of additional symbols, and
|
|
the use of a separate encoding for characters stored in the text frame
|
|
buffer ("screen codes").</p>
|
|
<p>Apple II computers were based on ASCII, but tended to store bytes
|
|
with the high bit set rather than clear. This is known as "high ASCII".</p>
|
|
|
|
<p>SourceGen allows you to specify that a string is encoded with ASCII,
|
|
High ASCII, C64 PETSCII, or C64 Screen Codes. Because the goal is to
|
|
generate assembly sources for cross-assemblers, the C64 character
|
|
support is limited to the set that overlaps with ASCII.</p>
|
|
<p>For the most part only printable characters are accepted in strings,
|
|
but certain control characters are also allowed. The characters for
|
|
bell ($07), linefeed ($0a), and carriage return ($0d) are recognized as
|
|
string data, and in C64 PETSCII a number of text color and formatting
|
|
control codes are also allowed.</p>
|
|
|
|
|
|
<h2><a name="sgintro">How SourceGen Works</a></h2>
|
|
|
|
<p>SourceGen employs a partial emulation technique that traces the flow
|
|
of execution. Most of what a given instruction does isn't important;
|
|
only its effect on the flow of execution matters.</p>
|
|
|
|
<p>The code tracing has to start somewhere, so SourceGen uses "code entry
|
|
point hints" to identify places where execution may begin. By default,
|
|
a hint is placed at the start of the file. From there, the tracing process
|
|
walks through the code, pursuing all branches. In many cases, if you
|
|
mark all external entry points, SourceGen will automatically find all
|
|
executable code and separate it from variable storage and data areas.</p>
|
|
|
|
<p>As noted earlier, tracking the processor status flags can make the
|
|
analysis more accurate. Identifying situations where a branch instruction
|
|
is always or never taken avoids mis-categorizing a data region as code.
|
|
On the 65816, it's absolutely crucial to track the M/X flags, since those
|
|
affect the width of instructions. SourceGen tracks the value of the
|
|
processor flags at every instruction, blending sets of flags together when
|
|
multiple paths of execution converge.</p>
|
|
|
|
<p>Once instructions and data have been separated, the instruction operands
|
|
can be examined. Branches, loads, and stores that reference an address
|
|
that falls inside the address space covered by the file can be replaced
|
|
with a symbol. Operands that refer to addresses outside the file, such
|
|
as ROM or operating system routines, can be replaced with a symbol defined
|
|
by an equate directive.</p>
|
|
|
|
(For more details on how this works, see the
|
|
<a href="analysis.html">analysis appendix</a>.)
|
|
|
|
|
|
<h3><a name="scripts">Extension Scripts</a></h3>
|
|
|
|
<p>Extension scripts are C# source files that are compiled and
|
|
executed by SourceGen. They can be added to a project from SourceGen's
|
|
runtime data directory, or can live in the directory next to the project
|
|
file.</p>
|
|
<p>In the current implementation, scripts are only called to examine
|
|
JSR, JSL, and BRK instructions. They can format nearby bytes as inline
|
|
data, or apply symbols to operands.</p>
|
|
|
|
<p>To reduce the chances of a script causing problems, all scripts are
|
|
executed in a sandbox with severely restricted access. Notably, nothing
|
|
in the sandbox can access files, except to read files from the PluginDll
|
|
directory.</p>
|
|
<p>The PluginDll directory lives next to the SourceGen executable, and
|
|
contains all of the compiled script DLLs, as well as two pre-built
|
|
application DLLs that plugins are allowed access to. The contents
|
|
are persistent, to avoid recompiling the scripts every time SourceGen
|
|
is launched, but may be manually deleted without harm.</p>
|
|
<p>More details can be found in the
|
|
<a href="advanced.html#extension-scripts">advanced topics</a> section.</p>
|
|
|
|
|
|
<h3><a name="hints">Analyzer Hints</a></h3>
|
|
|
|
<p>Sometimes SourceGen can't automatically find the start or end of an
|
|
instruction stream, or gets confused by inline data. These situations
|
|
can be resolved by adding an appropriate hint.</p>
|
|
|
|
<p><b>Code entry point hints</b> tell the analyzer to add the offset
|
|
to the list of instruction start points. Suppose you've got a code
|
|
library that begins with jump vectors, like this:</p>
|
|
<pre>
|
|
1000: 4c0910 JMP $1009
|
|
1003: 4cef10 JMP $10ef
|
|
1006: 4c3012 JMP $1230
|
|
1009: 18 CLC
|
|
</pre>
|
|
|
|
<p>When opened with SourceGen, it will look like this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
|
|
.DD1 $4c
|
|
.DD1 $ef
|
|
.DD1 $10
|
|
.DD1 $4c
|
|
.DD1 $30
|
|
.DD1 $12
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>SourceGen doesn't see any code that jumps to $1003 or $1006, so it
|
|
assumes those are data. Further, the functions at those addresses may
|
|
also be considered data unless some bit of code reachable from L1009
|
|
calls into them. If you add a code hint to $1003 and $1006,
|
|
you'll get better results:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
JMP L10ef
|
|
JMP L1230
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>Be careful that you only add hints to the instruction opcode. If
|
|
you applied hints to the full range of bytes from $1003 to $1008, you would
|
|
end up with this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
JMP ▼ L10ef
|
|
BPL ▼ L1053
|
|
JMP ▼ L1230
|
|
BMI L101b
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>The exact set of instructions shown depends on your CPU configuration.
|
|
The problem is that the bytes in the middle of the instruction have
|
|
been marked as entry points, and SourceGen is treating them as
|
|
embedded instructions. $EF and $12 aren't valid 6502 opcodes, so
|
|
they're being ignored, but $10 is BPL and $30 is BMI. Because hinting
|
|
multiple consecutive bytes is rarely useful, SourceGen only applies code
|
|
hints to the first byte in a selected line.</p>
|
|
|
|
<p><b>Data hints</b> tell the analyzer when it should stop. For example,
|
|
suppose address $ff00 is known to always be nonzero, and the code uses
|
|
that fact to get a branch-always on the 6502:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA $ff00
|
|
BNE L1010
|
|
BRK $11
|
|
</pre>
|
|
|
|
<p>By placing a data hint on the BRK, you're telling the analyzer that
|
|
it should stop the current path of execution. (Note that this example
|
|
would actually be better solved by setting a status flag override on
|
|
the BNE that sets Z=0, so the code tracer will know it's a branch-always
|
|
and do the right thing.) It's only necessary to place a hint on the
|
|
very first (opcode) byte. Placing a data hint in the middle of what
|
|
SourceGen believes to be instruction will have no effect.</p>
|
|
<p>As with code hints, only the first byte in each selected line will
|
|
be hinted.</p>
|
|
|
|
<p><b>Inline data hints</b> identify bytes as being part of the
|
|
instruction stream, but not instructions. A simple example of this
|
|
is the ProDOS 8 call interface on the Apple II, which looks like this:</p>
|
|
<pre>
|
|
JSR $bf00
|
|
.DD1 $function
|
|
.DD2 $address
|
|
BCS BAD
|
|
</pre>
|
|
|
|
<p>The three bytes following the <code>JSR $bf00</code> should be hinted
|
|
as inline data, so that the code analyzer skips them and continues the
|
|
analysis at the <code>BCS</code>. Because you need to hint <i>every</i> byte
|
|
of inline data, all bytes in a selected line will receive hints.</p>
|
|
<p>If code branches into a region that is marked as inline data, the
|
|
branch will be ignored.</p>
|
|
|
|
|
|
<h2><a name="sgconcepts">SourceGen Concepts</a></h2>
|
|
|
|
<p>As you work on a disassembled file, formatting operands and adding
|
|
comments, everything you do is saved in the project file as "meta data".
|
|
None of the data from the file being disassembled is included. This
|
|
should allow project files to be shared without violating the copyright
|
|
of the work being disassembled. (This will vary by region. Also, note
|
|
that the mere act of disassembling a piece of software may be illegal in
|
|
some cases.)</p>
|
|
|
|
<p>To avoid mix-ups where the wrong data file is used, the file's length
|
|
and CRC are stored in the project file. SourceGen will refuse to open a
|
|
project if the data file's length and CRC don't match.</p>
|
|
|
|
<p>Most of the data in the project file is associated with a file offset.
|
|
When you create a comment, you aren't associating it with line 53, you're
|
|
associating it with the 127th byte in the file. This ensures that, as the
|
|
project evolves, the comment you wrote is always connected to the
|
|
same instruction or data item. This also means you can't have two
|
|
comments on the same line -- each offset only has room for one. By
|
|
convention, file offsets are always shown as a six-digit hexadecimal value
|
|
with a leading '+', e.g. "+0012ab". This makes it easy to distinguish
|
|
between an address and a offset.</p>
|
|
|
|
<p>Instruction and data operands can be formatted in various ways. The
|
|
formatting choice is associated with the first offset of the item. For
|
|
instructions the number of bytes in the operand is determined by the opcode
|
|
(and, on the 65816, the M/X status flags). For data items the length
|
|
can be a single byte or an entire file. Operand formats are not allowed
|
|
to overlap.</p>
|
|
|
|
<p>When an instruction or data operand references an address, we call
|
|
it a <b>numeric reference</b>. When the target address has a label, and
|
|
the operand uses that symbol, we call that a <b>symbolic reference</b>.
|
|
SourceGen tries to establish symbolic references whenever possible,
|
|
so that the generated assembly source doesn't refer to hard-coded
|
|
locations within the program. Labels are generated automatically for
|
|
the targets of numeric references.</p>
|
|
|
|
<p>As your understanding of the disassembled code develops, you will want
|
|
to add comments explaining it. SourceGen projects have three kinds of
|
|
comments:</p>
|
|
<ol>
|
|
<li>End-of-line comments. As the name implies, these appear at the
|
|
end of a line, to the right of the opcode or operand.</li>
|
|
<li>Long comments, also known as multi-line comments. These get a
|
|
line all to themselves, and may span multiple lines.</li>
|
|
<li>Notes. Like long comments, these get a line to themselves. Unlike
|
|
long comments, these do not appear in generated assembly code. They
|
|
are a way for you to leave notes to yourself, perhaps "don't forget
|
|
to figure this out" or "this is the cool part".</li>
|
|
</ol>
|
|
<p>Every file offset can have one of each.</p>
|
|
|
|
<p>Labels and comments may disappear if you associate them with a file
|
|
offset that is in the middle of a multi-byte instruction or data item.
|
|
For example, suppose you put a long comment at offset +000010, and then
|
|
mark a 50-byte region starting at offset +000008 as an ASCII string. The
|
|
comment won't be deleted, but won't be displayed either. The same thing
|
|
can happen to labels. SourceGen will try to prevent this from happening
|
|
by splitting formatted data into sub-regions at label boundaries.</p>
|
|
|
|
|
|
<h2><a name="about-symbols">All About Symbols</a></h2>
|
|
|
|
<p>A symbol has two basic parts, a label and a value. The label is a short
|
|
ASCII string; the value may be an 8-to-24-bit address or a 32-bit numeric
|
|
constant. Symbols can be defined in different ways, and applied in
|
|
different ways.</p>
|
|
|
|
<p>The label syntax is restricted to a format that should be compatible
|
|
with most assemblers:</p>
|
|
<ul>
|
|
<li>2-32 characters long.</li>
|
|
<li>Starts with a letter or underscore.</li>
|
|
<li>Comprised of ASCII letters, numbers, and the underscore.</li>
|
|
</ul>
|
|
<p>Label comparisons are case-sensitive, as is customary for programming
|
|
languages.</p>
|
|
<p>Sometimes the purpose of a subroutine or variable isn't immediately
|
|
clear, but you can take a reasonable guess. You can document your
|
|
uncertainty by adding a question mark ('?') to the end of the label.
|
|
This isn't really part of the label, so it won't appear in the assembled
|
|
output, and you don't have to include it when searching for a symbol.</p>
|
|
<p>Some assemblers restrict the set of valid labels further. For example,
|
|
64tass uses a leading underscore to indicate a local label, and reserves
|
|
a double leading underscore (e.g. <code>__label</code>) for its own
|
|
purposes. In such cases, the label will be modified to comply with the
|
|
target assembler syntax.</p>
|
|
|
|
<p>Operands may use parts of symbols. For example, if you have a label
|
|
<code>MYSTRING</code>, you can write:</p>
|
|
<pre>
|
|
MYSTRING .STR "hello"
|
|
LDA #<MYSTRING
|
|
STA $00
|
|
LDA #>MYSTRING
|
|
STA $01
|
|
</pre>
|
|
<p>See <a href="#symbol-parts">Parts and Adjustments</a> for more details.</p>
|
|
|
|
<p>Symbols that represent a memory address within a project are treated
|
|
differently from those outside a project. We refer to these as internal
|
|
and external addresses, respectively.</p>
|
|
|
|
|
|
<h3><a name="connecting-operands">Connecting Operands with Labels</a></h3>
|
|
|
|
<p>Suppose you have the following code:</p>
|
|
<pre>
|
|
LDA $1234
|
|
JSR $2345
|
|
</pre>
|
|
<p>If we put that in a source file, it will assemble correctly.
|
|
However, if those addresses are part of the file, the code may break if
|
|
changes are made and things assemble to different addresses. It would
|
|
be better to generate code that references labels, e.g.:</p>
|
|
<pre>
|
|
LDA my_data
|
|
JSR nifty_func
|
|
</pre>
|
|
<p>SourceGen tries to establish labels for address operands automatically.
|
|
How this works depends on whether the operand's address is inside the file or
|
|
external, and whether there are existing labels at or near the target
|
|
address. The details are explored in the next few sections.</p>
|
|
<p>On the 65816 this process is trickier, because addresses are 24 bits
|
|
instead of 16. For a control-transfer instruction like <code>JSR</code>,
|
|
the high 8 bits come from the Program Bank Register (K). For a data-access
|
|
instruction like <code>LDA</code>, the high 8 bits come from the Data
|
|
Bank Register (B). The PBR value is determined by the address in which
|
|
the code is executing, so it's easy to determine. The DBR value can be
|
|
set arbitrarily. Sometimes it's easy to figure out, sometimes it has
|
|
to be specified manually.</p>
|
|
|
|
|
|
<h3><a name="internal-address-symbols">Internal Address Symbols</a></h3>
|
|
|
|
<p>Symbols that represent an address inside the file being disassembled
|
|
are referred to as <i>internal</i>. They come in two varieties.</p>
|
|
|
|
<p><b>User labels</b> are labels added to instructions or data by the user.
|
|
The editor will try to prevent you from creating a label that has the same
|
|
name as another symbol, but if you manage to do so, the user label takes
|
|
precedence over symbols from other sources. User labels may be tagged
|
|
as non-unique local, unique local, global, or global and exported. Local
|
|
vs. global is important for the label localizer, while exported symbols
|
|
can be pulled directly into other projects.</p>
|
|
|
|
<p><b>Auto labels</b> are automatically generated labels placed on
|
|
instructions or data offsets that are the target of operands. They're
|
|
formed by appending the hexadecimal address to the letter "L", with
|
|
additional characters added if some other symbol has already defined
|
|
that label. Options can be set that change the "L" to a character or
|
|
characters based on how the label is referenced, e.g. "B" for branch targets.
|
|
Auto labels are only added where they are needed, and are removed when
|
|
no longer necessary. Because auto labels may be renamed or vanish, the
|
|
editor will try to prevent you from referring to them explicitly when
|
|
editing operands.</p>
|
|
|
|
|
|
<h3><a name="external-address-symbols">External Address Symbols</a></h3>
|
|
|
|
<p>Symbols that represent an address outside the file being disassembled
|
|
are referred to as <i>external</i>. These may be ROM entry points,
|
|
data buffers, zero-page variables, or a number of other things. Because
|
|
the memory address they appear at aren't within the bounds of the file,
|
|
we can't simply put an address label on them. Three different mechanisms
|
|
exist for defining them. If an instruction or data operand refers to
|
|
an address outside the file bounds, SourceGen looks for a symbol with
|
|
a matching address value.</p>
|
|
|
|
<p><b>Platform symbols</b> are defined in platform symbol files. These
|
|
are named with a ".sym65" extension, and have a fairly straightforward
|
|
name/value syntax. Several files for popular platforms come with SourceGen
|
|
and live in the <code>RuntimeData</code> directory. You can also create your
|
|
own, but they have to live in the same directory as the project file.</p>
|
|
|
|
<p>Platform symbols can be addresses or constants. Addresses are
|
|
limited to 24-bit values, and are matched automatically. Constants may
|
|
be 32-bit values, but must be specified manually.</p>
|
|
|
|
<p>If two platform symbols have the same label, only the most recently read
|
|
one is kept. If two platform symbols have different labels but the
|
|
same value, both symbols will be kept, but the one in the file loaded
|
|
last will take priority when doing a lookup by address. If symbols with
|
|
the same value are defined in the same file, the one whose symbol appears
|
|
first alphabetically takes priority.</p>
|
|
|
|
<p>Platform address symbols have an optional width. This can be used
|
|
to define multi-byte items, such as two-byte pointers or 256-byte stacks.
|
|
If no width is specified, a default value of 1 is used. Widths are ignored
|
|
for constants.
|
|
Overlapping symbols are resolved as described earlier, with symbols loaded
|
|
later taking priority over previously-loaded symbols. In addition,
|
|
symbols defined closer to the target address take priority, so if you put
|
|
a 4-byte symbol in the middle of a 256-byte symbol, the 4-byte symbol will
|
|
be visible because the start point is closer to the addresses it covers
|
|
than the start of the 256-byte range.</p>
|
|
|
|
<p>Platform symbols can be designated for reading, writing, or both.
|
|
Normally you'd want both, but if an address is a memory-mapped I/O
|
|
location that has different behavior for reads and writes, you'd want
|
|
to define two different symbols, and have the correct one applied
|
|
based on the access type.</p>
|
|
|
|
<p><b>Project symbols</b> behave like platform symbols, but they are
|
|
defined in the project file itself, through the Project Properties editor.
|
|
The editor will try to prevent you from creating two symbols with the same
|
|
name. If two symbols have the same value, the one whose label comes
|
|
first alphabetically is used.</p>
|
|
|
|
<p>Project symbols always have precedence over platform symbols, allowing
|
|
you to redefine symbols within a project. (You can "hide" a platform
|
|
symbol by creating a project symbol constant with the same name. Use a
|
|
value like $ffffffff or $deadbeef so you'll know why it's there.)</p>
|
|
|
|
<p><b>Local variables</b> are redefinable symbols that are organized
|
|
into tables. They're used to specify labels for zero-page addresses
|
|
and 65816 stack-relative instructions. These are explained in more
|
|
detail in the next section.</p>
|
|
|
|
|
|
<h4><a name="local-vars">How Local Variables Work</a></h4>
|
|
|
|
<p>Local variables are applied to instructions that have zero
|
|
page operands (<code>op ZP</code>, <code>op (ZP),Y</code>, etc.), or
|
|
65816 stack relative operands
|
|
(<code>op OFF,S</code> or <code>op (OFF,S),Y</code>). While they must be
|
|
unique relative to other kinds of labels, they don't have to be unique
|
|
with respect to earlier variable definitions. So you can define
|
|
<code>TMP .EQ $10</code>, and a few lines later define
|
|
<code>TMP .EQ $20</code>. This is handy because zero-page addresses are
|
|
often used in different ways by different parts of the program. For
|
|
example:</p>
|
|
<pre>
|
|
LDA ($00),Y
|
|
INC $02
|
|
... elsewhere ...
|
|
DEC $00
|
|
STA ($01),Y
|
|
</pre>
|
|
<p>If we had given <code>$00</code> the label <code>PTR</code> and
|
|
<code>$02</code> the label <code>COUNT</code> globally,
|
|
the second pair of instructions would look all wrong. With local
|
|
variable tables you can set <code>PTR=$00 COUNT=$02</code> for the first chunk,
|
|
and <code>COUNT=$00 PTR=$01</code> for the second chunk.</p>
|
|
|
|
<p>Local variables have a value and a width. If we create a pair of
|
|
variable definitions like this:</p>
|
|
<pre>
|
|
PTR .eq $00 ;2 bytes
|
|
COUNT .eq $02 ;1 byte
|
|
</pre>
|
|
<p>Then this:</p>
|
|
<pre>
|
|
STA $00
|
|
STX $01
|
|
LDY $02
|
|
</pre>
|
|
<p>Would become:</p>
|
|
<pre>
|
|
STA PTR
|
|
STX PTR+1
|
|
LDY COUNT
|
|
</pre>
|
|
|
|
<p>The scope of a variable definition starts at the point where it is
|
|
defined, and stops when its definition is erased. There are three
|
|
ways for a table to erase an earlier definition:</p>
|
|
<ol>
|
|
<li>Create a new definition with the same name.</li>
|
|
<li>Create a new definition that has an overlapping value. For
|
|
example, if you have a two-byte variable <code>PTR = $00</code>,
|
|
and define a one-byte variable <code>COUNT = $01</code>, the
|
|
definition for <code>PTR</code> will be cleared because its second
|
|
byte overlaps.</li>
|
|
<li>Tables have a "clear previous" flag that erases all previous
|
|
definitions. This doesn't usually cause anything to be generated in the
|
|
assembly sources; instead, it just causes SourceGen to stop using
|
|
that label.</li>
|
|
</ol>
|
|
<p>As you might expect, you're not allowed to have duplicate labels or
|
|
overlapping values in an individual table.</p>
|
|
<p>If a platform/project symbol has the same value as a local variable,
|
|
the local variable is used. If the local variable definition is cleared,
|
|
use of the platform/project symbol will resume.</p>
|
|
<p>Not all assemblers support redefinable variables. In those cases,
|
|
the symbol names will be modified to be unique (e.g. the second definition
|
|
of <code>PTR</code> becomes <code>PTR_1</code>), and variables will have
|
|
global scope.</p>
|
|
|
|
|
|
<h3><a name="unique-local-global">Unique vs. Non-Unique and Local vs. Global</a></h3>
|
|
|
|
<p>Most assemblers have a notion of "local" labels, which have a scope
|
|
that is book-ended by global labels. These are handy for generic branch
|
|
target names like "loop" or "notzero" that you might want to use in
|
|
multiple places. The exact definition of local variable scope varies
|
|
between assemblers, so labels that you want to be local might have to
|
|
be promoted to global (and probably renamed).</p>
|
|
<p>SourceGen has a similar concept with a slight twist: they're called
|
|
non-unique labels, because the goal is to be able to use the same
|
|
label in more than one place. Whether or not they actually turn out
|
|
to be local is a decision deferred to assembly source generation time.
|
|
(You can also declare a label to be a unique local if you like; the
|
|
auto-generated labels like "L1234" do this.)</p>
|
|
<p>When you're writing code for an assembler, it has to be unambiguous,
|
|
because the assembler can't guess at what the output should be. For a
|
|
disassembler, the output is known, so a greater degree of ambiguity is
|
|
tolerable. Instead of throwing errors and refusing to continue, the
|
|
source generator can modify the output until it works. For example:<p>
|
|
<pre>
|
|
@LOOP LDX #$02
|
|
@LOOP DEX
|
|
BNE @LOOP
|
|
DEY
|
|
BNE @LOOP
|
|
</pre>
|
|
<p>This would confuse an assembler. SourceGen already knows which @LOOP
|
|
is being branched to, so it can just rename one of them to "@LOOP1".</p>
|
|
<p>One situation where non-unique labels cause difficulty is with
|
|
weak symbolic references (see next section). For example, suppose
|
|
the above code then did this:</p>
|
|
<pre>
|
|
LDA #<@LOOP
|
|
</pre>
|
|
<p>While it's possible to make an educated guess at which @LOOP was
|
|
meant, it's easy to get wrong. In situations like this, it's best to
|
|
give the labels different names.</p>
|
|
|
|
|
|
<h3><a name="weak-refs">Weak Symbolic References</a></h3>
|
|
|
|
<p>Symbolic references in operands are "weak references". If the named
|
|
symbol exists, the reference is used. If the symbol can't be found, the
|
|
operand is formatted in hex instead. They're called "weak" because
|
|
failing to resolve the reference isn't considered an error.</p>
|
|
|
|
<p>It's important to know this when editing a project. Consider the
|
|
following trivial chunk of code:</p>
|
|
|
|
<pre>
|
|
1000: 4c0310 JMP $1003
|
|
1003: ea NOP
|
|
</pre>
|
|
|
|
<p>When you load it into SourceGen, it will be formatted like this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1003
|
|
L1003 NOP
|
|
</pre>
|
|
|
|
<p>The analyzer found the JMP operand, and created an auto label for
|
|
address $1003. It then created a weak reference to "L1003" in the JMP
|
|
operand.</p>
|
|
|
|
<p>If you edit the JMP instruction's operand to use the symbol "FOO", the
|
|
results are probably not what you want:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP $1003
|
|
NOP
|
|
</pre>
|
|
|
|
<p>This happened because you added a weak reference to "FOO" in the operand,
|
|
but the label doesn't exist. The operand is formatted as hex. Because
|
|
there's no longer a reference to L1003, SourceGen removed the auto-label
|
|
as well.</p>
|
|
|
|
<p>If you set the label "FOO" on the NOP instruction, you'll see what you
|
|
probably wanted:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP FOO
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>You don't actually need the explicit reference in the JMP instruction.
|
|
If you edit the JMP operand and set it back to "Default", the code will
|
|
still look the same. This is because SourceGen identified the numeric
|
|
reference, and automatically added a symbolic reference to the label on
|
|
the NOP instruction.</p>
|
|
|
|
<p>However, suppose you didn't actually want FOO as the operand label.
|
|
You can create a project symbol, BAR with the value $1003, and then edit
|
|
the operand to reference BAR instead. Your code would then look like:</p>
|
|
<pre>
|
|
BAR .EQ $1003
|
|
.ORG $1000
|
|
JMP BAR
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>If you change the value of BAR in the project symbol file, the operand
|
|
will continue to refer to it, but with an adjustment. For example, if
|
|
you changed BAR from $1003 to $1007, the code would become:</p>
|
|
<pre>
|
|
BAR .EQ $1007
|
|
.ORG $1000
|
|
JMP BAR-4
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>If you rename a label, all references to that label are updated. For
|
|
numeric references that happens implicitly. For explicit operand
|
|
references, the weak references are updated individually. (Modern IDEs
|
|
call this "refactoring".)</p>
|
|
<p>If you remove a label, all of the numeric references to it will
|
|
reference something else, probably a new auto label. Weak references
|
|
to the symbol will break and be formatted as hex, but will not be
|
|
removed. Similarly, removing symbols from a platform or project file
|
|
will break the reference but won't modify the operands.</p>
|
|
|
|
<h3><a name="symbol-parts">Parts and Adjustments</a></h3>
|
|
|
|
<p>Sometimes you want to use part of a label, or adjust the value slightly.
|
|
(I use "adjustment" rather than "offset" to avoid confusing it with file
|
|
offsets.) Consider the following example:</p>
|
|
<pre>
|
|
1000: a910 LDA #$10
|
|
1002: 48 PHA
|
|
1003: a906 LDA #$06
|
|
1005: 48 PHA
|
|
1006: 60 RTS
|
|
1007: 4c3aff JMP $ff3a
|
|
</pre>
|
|
|
|
<p>This pushes the address of the JMP instruction ($1007) onto the stack,
|
|
and jumps to it with the RTS instruction. However, RTS requires the
|
|
address of the byte before the target instruction, so we actually push
|
|
$1006.</p>
|
|
|
|
<p>The disassembler won't know that offset $1007 is code because nothing
|
|
appears to reference it. After adding a code hint at $1007, the project
|
|
looks like this:</p>
|
|
<pre>
|
|
LDA #$10
|
|
PHA
|
|
LDA #$06
|
|
PHA
|
|
RTS
|
|
|
|
JMP $ff3a
|
|
</pre>
|
|
|
|
<p>We set a label called "NEXT" on the JMP instruction, and then edit
|
|
the two LDA instructions to reference the high and low parts, yielding:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #>NEXT
|
|
PHA
|
|
LDA #<NEXT-1
|
|
PHA
|
|
RTS
|
|
|
|
NEXT JMP $ff3a
|
|
</pre>
|
|
|
|
<p>SourceGen will adjust label values by whatever amount is required to
|
|
generate the original value. If the adjustment seems wrong, make sure
|
|
you're selecting the right part of the symbol.</p>
|
|
|
|
<p>Different assemblers use different syntaxes to form expressions. This
|
|
is particularly noticeable in 65816 code. You can adjust how it appears
|
|
on-screen from the app settings.</p>
|
|
|
|
<h3><a name="nearby-targets">Automatic Use of Nearby Targets</a></h3>
|
|
|
|
<p>Sometimes you want to use a symbol that doesn't match up with the
|
|
operand. SourceGen tries to anticipate situations where that might be
|
|
the case, and apply adjustments for you.</p>
|
|
|
|
<p>Suppose you have the following:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #$00
|
|
STA L1010
|
|
LDA #$20
|
|
STA L1011
|
|
LDA #$e1
|
|
STA L1012
|
|
RTS
|
|
|
|
L1010 .DD1 $00
|
|
L1011 .DD1 $00
|
|
L1012 .DD1 $00
|
|
</pre>
|
|
|
|
<p>Showing stores to three different labeled addresses is fine, but
|
|
the code is actually setting up a single 24-bit address. For clarity,
|
|
you'd like the output to reflect the fact that it's a single, multi-byte
|
|
variable. So, if you set a label at $1010, SourceGen removes the
|
|
nearby auto labels, and sets the numeric references to use your label:</p>
|
|
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #$00
|
|
STA DATA
|
|
LDA #$20
|
|
STA DATA+1
|
|
LDA #$e1
|
|
STA DATA+2
|
|
RTS
|
|
|
|
DATA .DD1 $00
|
|
.DD1 $00
|
|
.DD1 $00
|
|
</pre>
|
|
|
|
<p>If you decide that you really wanted each store to have its own
|
|
label, you can set labels on the other two addresses. SourceGen won't
|
|
search for alternate labels if the numeric reference target has a
|
|
user-defined label.</p>
|
|
|
|
<p>This is also used for self-modifying code. For example:</p>
|
|
<pre>
|
|
1000: a9ff LDA #$ff
|
|
1002: 8d0610 STA $1006
|
|
1005: 4900 EOR #$00
|
|
</pre>
|
|
|
|
<p>The above changes the <code>EOR #$00</code> instruction to
|
|
<code>EOR #$ff</code>. The operand target is $1006, but we can't
|
|
put a label there because it's in the middle of the instruction. So
|
|
SourceGen puts a label at $1005 and adjusts it:</p>
|
|
<pre>
|
|
LDA #$ff
|
|
STA L1005+1
|
|
L1005 EOR #$00
|
|
</pre>
|
|
|
|
<p>If you really don't like the way this works, you can disable the
|
|
search for nearby targets entirely from the
|
|
<a href="settings.html#project-properties">project properties</a>.
|
|
Self-modifying code will always be adjusted because of the limitation
|
|
on mid-instruction labels.</p>
|
|
|
|
|
|
<h2><a name="width-disambiguation">Width Disambiguation</a></h2>
|
|
|
|
<p>It's possible to interpret certain instructions in multiple ways.
|
|
For example, "LDA $0000" might be an absolute load from a 16-bit
|
|
address, or it might be a direct page load from an 8-bit address.
|
|
Humans can infer from the fact that it was written with a 4-digit address
|
|
that it's meant to be absolute, but assemblers often treat operands
|
|
purely as numbers, and would just see "LDA 0". Common practice is to
|
|
use the shortest instruction possible.</p>
|
|
<p>Every assembler seems to address the problem in a slightly different
|
|
way. Some use opcode suffixes, others use operand prefixes, some
|
|
allow both. You can configure how they appear in the
|
|
<a href="settings.html#app-settings">application settings</a>.</p>
|
|
<p>SourceGen will only add width disambiguators to opcodes or operands when
|
|
they are needed, with one exception: the opcode suffix for long
|
|
(24-bit address) operations is always applied. This is done because some
|
|
assemblers require it, insisting on "LDAL" rather than "LDA" for an
|
|
absolute long load, and because it can make 65816 code easier to read.</p>
|
|
|
|
|
|
<h2><a name="pseudo-ops">Data and Directive Pseudo-Opcodes</a></h2>
|
|
|
|
<p>The on-screen code list shows assembler directives that are similar
|
|
to what the various cross-assemblers provide. The actual directives
|
|
generated for a given assembler may match exactly or be totally different.
|
|
The idea is to represent the concept behind the directive, then let the
|
|
code generator figure out the implementation details.</p>
|
|
|
|
<p>There are six assembler directives that appear in the code list:</p>
|
|
<ul>
|
|
<li>.EQ - defines a symbol's value. These are generated automatically
|
|
when an operand that matches a platform or project symbol is found.</li>
|
|
<li>.VAR - defines a local variable. These are generated for
|
|
local variable tables.</li>
|
|
<li>.ORG - changes the target address.</li>
|
|
<li>.RWID - specifies the width of the accumulator and index registers
|
|
(65816 only). Note this doesn't change the actual width, just tells
|
|
the assembler that the width has changed.</li>
|
|
<li>.DBANK - specifies what value the Data Bank Register holds
|
|
(65816 only). Used when matching operands to labels.</li>
|
|
<li>.JUNK - indicates that the data in a range of bytes is irrelevant.
|
|
(When generating sources, this will become .FILL or .BULK
|
|
depending on the contents of the memory region and the assembler's
|
|
capabilities.)</li>
|
|
<li>.ALIGN - a special case of .JUNK that indicates the irrelevant
|
|
bytes exist to force alignment to a memory boundary (usually a
|
|
256-byte page). Depending on the memory contents, it may be possible
|
|
to output this as an assembler-specific alignment directive.</li>
|
|
</ul>
|
|
|
|
<p>Every data item is represented by a pseudo-op. Some of them may
|
|
represent hundreds of bytes and span multiple lines.</p>
|
|
<ul>
|
|
<li>.DD1, .DD2, .DD3, .DD4 - basic "define data" op. A 1-4 byte
|
|
little-endian value.</li>
|
|
<li>.DBD2, .DBD3, .DBD4 - "define big-endian data". 2-4 bytes of
|
|
big-endian data. (The 3- and 4-byte versions are not currently
|
|
available in the UI, since they're very unusual and few assemblers
|
|
support them.)</li>
|
|
<li>.BULK - data packed in as compact a form as the assembler allows.
|
|
Useful for chunks of graphics data.</li>
|
|
<li>.FILL - a series of identical bytes. The operand
|
|
has two parts, the byte count followed by the byte value.</li>
|
|
</ul>
|
|
|
|
<p>In addition, several pseudo-ops are defined for string constants:</p>
|
|
<ul>
|
|
<li>.STR - basic character string.</li>
|
|
<li>.RSTR - string in reverse order.</li>
|
|
<li>.ZSTR - null-terminated string.</li>
|
|
<li>.DSTR - Dextral Character Inverted string. The high bit of the
|
|
last byte is flipped.</li>
|
|
<li>.L1STR - string prefixed with a length byte.</li>
|
|
<li>.L2STR - string prefixed with a length word.</li>
|
|
</ul>
|
|
|
|
<p>You can configure the pseudo-operands to look more like what your
|
|
favorite assembler uses in the
|
|
<a href="settings.html#appset-pseudoop">Pseudo-Op</a> tab in the
|
|
application settings.</p>
|
|
|
|
<p>String constants start and end with delimiter characters, typically
|
|
single or double quotes. You can configure the delimiters differently
|
|
for each character encoding, so that it's obvious whether the text is
|
|
in ASCII or PETSCII. See the
|
|
<a href="settings.html#appset-textdelim">Text Delimiters</a> tab in
|
|
the application settings.</p>
|
|
|
|
|
|
</div>
|
|
|
|
<div id="footer">
|
|
<p><a href="index.html">Back to index</a></p>
|
|
</div>
|
|
</body>
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
</html>
|