mirror of
https://github.com/fadden/6502bench.git
synced 2024-12-01 22:50:35 +00:00
44c140a8d0
It's possible to define multiple project symbols with the same address. The way to resolve the ambiguity is to explicitly reference the desired symbol from the operand. This was the default behavior of the "create project symbol" shortcut in the previous version. It's rarely necessary, and it can get ugly if you rename a project symbol, because we don't refactor operands in that case.
837 lines
35 KiB
HTML
837 lines
35 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<head>
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
<title>Intro - 6502bench SourceGen</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="content">
|
|
<h1>6502bench SourceGen: Intro</h1>
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
<h2><a name="overview">Overview</a></h2>
|
|
|
|
<p>SourceGen converts 6502/65C02/65816 machine-language programs to
|
|
assembly-language source.</p>
|
|
|
|
<p>SourceGen has two purposes. The first is to be a really nice
|
|
disassembler for the 6502 and related CPUs. Code tracing with status
|
|
flag tracking makes it easier to separate the code from the data,
|
|
automatic formatting of character strings and filled-data areas helps
|
|
get the data regions sorted out, and modern IDE-style features like
|
|
cross-reference generation and color-highlighted bookmarks help
|
|
navigate the code while trying to figure out what it does. A
|
|
disassembler should help you understand the code, not just dump the
|
|
instructions to a text file.</p>
|
|
<p>The computer I built in 2014 has a 4GHz CPU and 8GB of RAM.
|
|
I figured we should put that kind of power to good use.</p>
|
|
|
|
<p>The second purpose is to facilitate sharing and collaboration. Most
|
|
disassemblers generate output for a specific assembler, or in a way that's
|
|
generic enough to match most any assembler; either way, you're left with
|
|
a text file in somebody's idea of the "correct" format. SourceGen keeps
|
|
everything in an assembler-neutral format, and provides numerous options
|
|
for customizing the display, so that multiple people viewing the same
|
|
project can each do so with the conventions they are accustomed to.
|
|
Code and data operands can be formatted in various numeric formats or
|
|
as symbols.
|
|
The project file uses a text format that is fairly diff-friendly, so
|
|
sharing projects through git works reasonably well. If you want source
|
|
code you can assemble, SourceGen will generate code optimized for the
|
|
assembler of your choice.</p>
|
|
|
|
<p>The sharing and collaboration ideas only work if the formatting
|
|
capabilities within SourceGen are sufficiently flexible. If you need to
|
|
generate assembly source and tweak it a bunch to express the intent of
|
|
the original code, then passing a SourceGen project around won't work.
|
|
This sort of thing is a bit outside the bounds of what a typical
|
|
disassembler does, so it remains to be seen whether SourceGen succeeds at
|
|
what it's trying to do, and also whether what it's trying to do is
|
|
something that people actually want.</p>
|
|
|
|
<p>You can get started by watching the
|
|
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and playing with the
|
|
<a href="tutorials.html">tutorials</a>.</p>
|
|
|
|
|
|
<h2><a name="fundamental-concepts">Fundamental Concepts</a></h2>
|
|
|
|
<p>The next few sections present some general concepts and terminology. The
|
|
rest of the documentation assumes you've read and understood this. It will
|
|
be helpful if you already understand something about the 6502 instruction
|
|
set and assembly-language programming, but disassembling other programs is
|
|
actually a pretty good way to learn how to code in assembly.</p>
|
|
|
|
<h2><a name="begin">About 6502 Code</a></h2>
|
|
|
|
<p>For brevity's sake, "6502 code" should be taken to mean "code for
|
|
the 6502 CPU or any of its derivatives, including but not limited to
|
|
the 65C02 and 65816". So let's talk about 6502 code.</p>
|
|
|
|
<p>Code usually arrives in a big binary blob. Some of it will be
|
|
instructions, some of it will be data, some will be empty space used
|
|
for variable storage. Part of the challenge of disassembly is
|
|
identifying which parts of the file contain which.</p>
|
|
|
|
<p>Much of the code you'll find for the 6502 was written by humans,
|
|
rather than generated by a compiler, which means it won't conform to a
|
|
standard set of conventions. However, most programmers will use
|
|
subroutines, which can be identified and analyzed in isolation. Subroutines
|
|
are often interspersed with variable storage, referred to as a "stash".
|
|
Variables and constants may be single-byte or multi-byte, the latter
|
|
typically in little-endian byte order.</p>
|
|
|
|
<p>Much of the data in a typical program is read-only, often in the
|
|
form of graphics or character string data. Graphics can be difficult
|
|
to recognize automatically, but strings can be identified with a
|
|
reasonable degree of confidence. Address tables, which are a collection
|
|
of addresses to other things, are also fairly common.</p>
|
|
|
|
<p>A simple disassembler would start at the top of the file and just
|
|
start converting bytes to instructions. Unfortunately there's no reliable
|
|
way to tell the difference between instructions, data, and variable
|
|
stashes. When the converter hits data bytes it'll start generating
|
|
instructions that won't make sense. You'll have another problem when the
|
|
data ends and code resumes: 6502 instructions are variable-length, so if
|
|
the last byte of the data area appears to be a three-byte instruction,
|
|
the first two bytes of the next instruction area will be gobbled up.</p>
|
|
|
|
<p>To make things even more difficult (sometimes deliberately), programmers
|
|
will sometimes use a trick where they "embed" an instruction
|
|
inside another instruction. This allows code to branch to two different
|
|
entry points, one of which will set a flag or load a register, and then
|
|
continue on to common code.</p>
|
|
|
|
<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
|
|
The called subroutine pulls the caller's address off the stack, uses it to
|
|
access the parameters, then pushes the address back on after modifying it to
|
|
point to an address past the inline data. This can be very confusing
|
|
for the disassembler, which will try to interpret the inline data as
|
|
instructions.</p>
|
|
|
|
<p>Sometimes code is loaded at one location, then moved to another and
|
|
executed there. If you're disassembling an executing program you don't
|
|
have to worry about this, but if you're disassembling the binary from the
|
|
loadable file on disk then you need to track the address changes. The
|
|
address is communicated to the assembler with a "pseudo-opcode", usually
|
|
something like "ORG". Other pseudo-op directives are used to define external
|
|
symbols and (for 65816 code) register widths.</p>
|
|
|
|
<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
|
|
range from $0000 to $ffff. (I'm going to write hex values with a
|
|
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
|
|
that's what 6502 systems commonly used.) The 65816 has a 24-bit address
|
|
space, but it's not contiguous -- a branch that extends past the end will
|
|
wrap around to the start of the 64KiB "bank". For 16-bit instruction
|
|
operands, the bank is identified for instruction and data addresses
|
|
by the program bank register and the data bank register, respectively.
|
|
The disassembler can't generally know the contents of the data bank
|
|
register, which makes life a bit more interesting.</p>
|
|
|
|
<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
|
|
in it. Some of the flags determine whether a conditional branch is taken
|
|
or not, which is important because some branches appear to be conditional
|
|
but actually are always or never taken in practice. The disassembler needs
|
|
to be able to figure this out so that it doesn't try to disassemble the
|
|
bytes that follow an always-taken branch.
|
|
A more significant concern is the M and X flags found on the 65802/65816,
|
|
which determine the width of the registers and of immediate load
|
|
instructions. If you don't know what state the flags are in, you can't
|
|
know whether <code>LDA #value</code> is two bytes or three, and the
|
|
disassembly of the instruction stream will come out wrong.</p>
|
|
|
|
<h3><a name="charenc">Character Encoding</a></h3>
|
|
|
|
<p>The American Standard Code for Information Interchange (ASCII) was
|
|
developed in the 1960s, and became widely used as the means for representing
|
|
text data on a computer. It's compatible with Unicode, in that the
|
|
binary representation of an ASCII string is exactly the same when
|
|
expressed as a Unicode string with UTF-8 encoding.</p>
|
|
<p>Not all 6502-based computers used ASCII, notably those from Commodore
|
|
International (e.g. PET, VIC-20, 64, 128), which used variants
|
|
collectively known as "PETSCII". PETSCII had most of the same symbols,
|
|
but rearranged them, and added a number of graphical symbols. This was
|
|
further complicated by the use of two different character sets, one of
|
|
which dropped lower-case letters in favor of additional symbols, and
|
|
the use of a separate encoding for characters stored in the text frame
|
|
buffer ("screen codes").</p>
|
|
<p>Apple II computers were based on ASCII, but tended to store bytes
|
|
with the high bit set rather than clear. This is known as "high ASCII".</p>
|
|
|
|
<p>SourceGen allows you to specify that a string is encoded with ASCII,
|
|
High ASCII, C64 PETSCII, or C64 Screen Codes. Because the goal is to
|
|
generate assembly sources for cross-assemblers, the C64 character
|
|
support is limited to the set that overlaps with ASCII.</p>
|
|
<p>For the most part only printable characters are accepted in strings,
|
|
but certain control characters are also allowed. The characters for
|
|
bell ($07), linefeed ($0a), and carriage return ($0d) are recognized as
|
|
string data, and in C64 PETSCII a number of text color and formatting
|
|
control codes are also allowed.</p>
|
|
|
|
|
|
<h2><a name="sgintro">How SourceGen Works</a></h2>
|
|
|
|
<p>SourceGen employs a partial emulation technique that traces the flow
|
|
of execution. Most of what a given instruction does isn't important;
|
|
only its effect on the flow of execution matters.</p>
|
|
|
|
<p>The code tracing has to start somewhere, so SourceGen uses "code entry
|
|
point hints" to identify places where execution may begin. By default,
|
|
a hint is placed at the start of the file. From there, the tracing process
|
|
walks through the code, pursuing all branches. In many cases, if you
|
|
mark all external entry points, SourceGen will automatically find all
|
|
executable code and separate it from variable storage and data areas.</p>
|
|
|
|
<p>As noted earlier, tracking the processor status flags can make the
|
|
analysis more accurate. Identifying situations where a branch instruction
|
|
is always or never taken avoids mis-categorizing a data region as code.
|
|
On the 65816, it's absolutely crucial to track the M/X flags, since those
|
|
affect the width of instructions. SourceGen tracks the value of the
|
|
processor flags at every instruction, blending sets of flags together when
|
|
multiple paths of execution converge.</p>
|
|
|
|
<p>Once instructions and data have been separated, the instruction operands
|
|
can be examined. Branches, loads, and stores that reference an address
|
|
that falls inside the address space covered by the file can be replaced
|
|
with a symbol. Operands that refer to addresses outside the file, such
|
|
as ROM or operating system routines, can be replaced with a symbol defined
|
|
by an equate directive.</p>
|
|
|
|
(For more details on how this works, see the
|
|
<a href="analysis.html">analysis appendix</a>.)
|
|
|
|
|
|
<h3><a name="scripts">Extension Scripts</a></h3>
|
|
|
|
<p>Extension scripts are C# source files that are compiled and
|
|
executed by SourceGen. They can be added to a project from SourceGen's
|
|
runtime data directory, or can live in the directory next to the project
|
|
file.</p>
|
|
<p>In the current implementation, scripts are only called to examine
|
|
JSR, JSL, and BRK instructions. They can format nearby bytes as inline
|
|
data, or apply symbols to operands.</p>
|
|
|
|
<p>To reduce the chances of a script causing problems, all scripts are
|
|
executed in a sandbox with severely restricted access. Notably, nothing
|
|
in the sandbox can access files, except to read files from the PluginDll
|
|
directory.</p>
|
|
<p>The PluginDll directory lives next to the SourceGen executable, and
|
|
contains all of the compiled script DLLs, as well as two pre-built
|
|
application DLLs that plugins are allowed access to. The contents
|
|
are persistent, to avoid recompiling the scripts every time SourceGen
|
|
is launched, but may be manually deleted without harm.</p>
|
|
<p>More details can be found in the
|
|
<a href="advanced.html#extension-scripts">advanced topics</a> section.</p>
|
|
|
|
|
|
<h3><a name="hints">Analyzer Hints</a></h3>
|
|
|
|
<p>Sometimes SourceGen can't automatically find the start or end of an
|
|
instruction stream, or gets confused by inline data. These situations
|
|
can be resolved by adding an appropriate hint.</p>
|
|
|
|
<p><b>Code entry point hints</b> tell the analyzer to add the offset
|
|
to the list of instruction start points. Suppose you've got a code
|
|
library that begins with jump vectors, like this:</p>
|
|
<pre>
|
|
1000: 4c0910 JMP $1009
|
|
1003: 4cef10 JMP $10ef
|
|
1006: 4c3012 JMP $1230
|
|
1009: 18 CLC
|
|
</pre>
|
|
|
|
<p>When opened with SourceGen, it will look like this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
|
|
.DD1 $4c
|
|
.DD1 $ef
|
|
.DD1 $10
|
|
.DD1 $4c
|
|
.DD1 $30
|
|
.DD1 $12
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>SourceGen doesn't see any code that jumps to $1003 or $1006, so it
|
|
assumes those are data. Further, the functions at those addresses may
|
|
also be considered data unless some bit of code reachable from L1009
|
|
calls into them. If you add a code hint to $1003 and $1006,
|
|
you'll get better results:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
JMP L10ef
|
|
JMP L1230
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>Be careful that you only add hints to the instruction opcode. If
|
|
you applied hints to the full range of bytes from $1003 to $1008, you would
|
|
end up with this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1009
|
|
JMP ▼ L10ef
|
|
BPL ▼ L1053
|
|
JMP ▼ L1230
|
|
BMI L101b
|
|
L1009 CLC
|
|
</pre>
|
|
|
|
<p>The exact set of instructions shown depends on your CPU configuration.
|
|
The problem is that the bytes in the middle of the instruction have
|
|
been marked as entry points, and SourceGen is treating them as
|
|
embedded instructions. $EF and $12 aren't valid 6502 opcodes, so
|
|
they're being ignored, but $10 is BPL and $30 is BMI. Because hinting
|
|
multiple consecutive bytes is rarely useful, SourceGen only applies code
|
|
hints to the first byte in a selected line.</p>
|
|
|
|
<p><b>Data hints</b> tell the analyzer when it should stop. For example,
|
|
suppose address $ff00 is known to always be nonzero, and the code uses
|
|
that fact to get a branch-always on the 6502:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA $ff00
|
|
BNE L1010
|
|
BRK $11
|
|
</pre>
|
|
|
|
<p>By placing a data hint on the BRK, you're telling the analyzer that
|
|
it should stop the current path of execution. (Note that this example
|
|
would actually be better solved by setting a status flag override on
|
|
the BNE that sets Z=0, so the code tracer will know it's a branch-always
|
|
and do the right thing.) It's only necessary to place a hint on the
|
|
very first (opcode) byte. Placing a data hint in the middle of what
|
|
SourceGen believes to be instruction will have no effect.</p>
|
|
<p>As with code hints, only the first byte in each selected line will
|
|
be hinted.</p>
|
|
|
|
<p><b>Inline data hints</b> identify bytes as being part of the
|
|
instruction stream, but not instructions. A simple example of this
|
|
is the ProDOS 8 call interface on the Apple II, which looks like this:</p>
|
|
<pre>
|
|
JSR $bf00
|
|
.DD1 $function
|
|
.DD2 $address
|
|
BCS BAD
|
|
</pre>
|
|
|
|
<p>The three bytes following the <code>JSR $bf00</code> should be hinted
|
|
as inline data, so that the code analyzer skips them and continues the
|
|
analysis at the <code>BCS</code>. Because you need to hint <i>every</i> byte
|
|
of inline data, all bytes in a selected line will receive hints.</p>
|
|
<p>If code branches into a region that is marked as inline data, the
|
|
branch will be ignored.</p>
|
|
|
|
|
|
<h2><a name="sgconcepts">SourceGen Concepts</a></h2>
|
|
|
|
<p>As you work on a disassembled file, formatting operands and adding
|
|
comments, everything you do is saved in the project file as "meta data".
|
|
None of the data from the file being disassembled is included. This
|
|
should allow project files to be shared without violating the copyright
|
|
of the work being disassembled. (This will vary by region. Also, note
|
|
that the mere act of disassembling a piece of software may be illegal in
|
|
some cases.)</p>
|
|
|
|
<p>To avoid mix-ups where the wrong data file is used, the file's length
|
|
and CRC are stored in the project file. SourceGen will refuse to open a
|
|
project if the data file's length and CRC don't match.</p>
|
|
|
|
<p>Most of the data in the project file is associated with a file offset.
|
|
When you create a comment, you aren't associating it with line 53, you're
|
|
associating it with the 127th byte in the file. This ensures that, as the
|
|
project evolves, the comment you wrote is always connected to the
|
|
same instruction or data item. This also means you can't have two
|
|
comments on the same line -- each offset only has room for one. By
|
|
convention, file offsets are always shown as a six-digit hexadecimal value
|
|
with a leading '+', e.g. "+0012ab". This makes it easy to distinguish
|
|
between an address and a offset.</p>
|
|
|
|
<p>Instruction and data operands can be formatted in various ways. The
|
|
formatting choice is associated with the first offset of the item. For
|
|
instructions the number of bytes in the operand is determined by the opcode
|
|
(and, on the 65816, the M/X status flags). For data items the length
|
|
can be a single byte or an entire file. Operand formats are not allowed
|
|
to overlap.</p>
|
|
|
|
<p>When an instruction or data operand references an address, we call
|
|
it a <b>numeric reference</b>. When the target address has a label, and
|
|
the operand uses that symbol, we call that a <b>symbolic reference</b>.
|
|
SourceGen tries to establish symbolic references whenever possible,
|
|
so that the generated assembly source doesn't refer to hard-coded
|
|
locations within the program. Labels are generated automatically for
|
|
the targets of numeric references.</p>
|
|
|
|
<p>As your understanding of the disassembled code develops, you will want
|
|
to add comments explaining it. SourceGen projects have three kinds of
|
|
comments:</p>
|
|
<ol>
|
|
<li>End-of-line comments. As the name implies, these appear at the
|
|
end of a line, to the right of the opcode or operand.</li>
|
|
<li>Long comments, also known as multi-line comments. These get a
|
|
line all to themselves, and may span multiple lines.</li>
|
|
<li>Notes. Like long comments, these get a line to themselves. Unlike
|
|
long comments, these do not appear in generated assembly code. They
|
|
are a way for you to leave notes to yourself, perhaps "don't forget
|
|
to figure this out" or "this is the cool part".</li>
|
|
</ol>
|
|
<p>Every file offset can have one of each.</p>
|
|
|
|
<p>Labels and comments may disappear if you associate them with a file
|
|
offset that is in the middle of a multi-byte instruction or data item.
|
|
For example, suppose you put a long comment at offset +000010, and then
|
|
mark a 50-byte region starting at offset +000008 as an ASCII string. The
|
|
comment won't be deleted, but won't be displayed either. The same thing
|
|
can happen to labels. SourceGen will try to prevent this from happening
|
|
by splitting formatted data into sub-regions at label boundaries.</p>
|
|
|
|
|
|
<h2><a name="about-symbols">All About Symbols</a></h2>
|
|
|
|
<p>A symbol has two parts, a label and a value. The label is a short
|
|
ASCII string; the value may be an 8-to-24-bit address or a numeric
|
|
constant. Symbols can be defined in different ways, and applied in
|
|
different ways.</p>
|
|
|
|
<p>The label syntax is restricted to a format that should be compatible
|
|
with most assemblers:</p>
|
|
<ul>
|
|
<li>2-32 characters long.</li>
|
|
<li>Starts with a letter or underscore.</li>
|
|
<li>Comprised of ASCII letters, numbers, and the underscore.</li>
|
|
</ul>
|
|
<p>Label comparisons are case-sensitive, as is customary for programming
|
|
languages.</p>
|
|
<p>Some assemblers restrict the set of valid labels further. For example,
|
|
64tass uses a leading underscore to indicate a local label, and reserves
|
|
a double leading underscore (e.g. <code>__label</code>) for its own
|
|
purposes. In such cases, the label will be modified to comply with the
|
|
target assembler syntax.</p>
|
|
|
|
<p>Operands may use parts of symbols. For example, if you have a label
|
|
<code>MYSTRING</code>, you can write:</p>
|
|
<pre>
|
|
MYSTRING .STR "hello"
|
|
LDA #<MYSTRING
|
|
STA $00
|
|
LDA #>MYSTRING
|
|
STA $01
|
|
</pre>
|
|
<p>See <a href="#symbol-parts">Parts and Adjustments</a> for more details.</p>
|
|
|
|
<h4><a name="symbol-types">Symbol Types</a></h4>
|
|
|
|
<p><b>Platform symbols</b> are defined in platform symbol files. These
|
|
are named with a ".sym65" extension, and have a fairly straightforward
|
|
name/value syntax. Several files for popular platforms come with SourceGen
|
|
and live in the <code>RuntimeData</code> directory. You can also create your
|
|
own, but they have to live in the same directory as the project file.</p>
|
|
|
|
<p>Platform symbols can be addresses or constants. If an instruction
|
|
or data operand references an address outside the scope of the data
|
|
file, SourceGen looks for a symbol with a matching address value. If
|
|
it finds one, it automatically uses that symbol. Symbolic constants
|
|
can be used the same way, but are not matched automatically. This makes
|
|
them useful for things like operating system function numbers.</p>
|
|
|
|
<p>If two platform symbols have the same value, the one whose label comes
|
|
first alphabetically is used. If two platform symbols have the same label,
|
|
the most recently read one is kept.</p>
|
|
|
|
<p><b>Project symbols</b> behave like platform symbols, but they are
|
|
defined in the project file itself. The editor will prevent you from
|
|
creating two symbols with the same name. If two symbols have the same
|
|
value, the one whose label comes first alphabetically is used.</p>
|
|
|
|
<p>Project symbols always have precedence over platform symbols, allowing
|
|
you to redefine symbols within a project. (You can "hide" a platform
|
|
symbol by creating a project symbol with the same name and an unused
|
|
value, such as $ffffffff.)</p>
|
|
|
|
<p><b>Local variables</b> are redefinable symbols that are organized
|
|
into tables. They're used to specify labels for zero-page and 65816
|
|
stack-relative instructions.</p>
|
|
|
|
<p><b>User labels</b> are labels added to instructions or data by the user.
|
|
The editor won't allow you to add a label that conflicts, but if you
|
|
manage to do so, the user label takes precedence over project and platform
|
|
symbols. User labels may be tagged as local, global, or global and
|
|
exported. Local vs. global is important for the label localizer, while
|
|
exported symbols can be pulled directly into other projects.</p>
|
|
|
|
<p><b>Auto labels</b> are automatically generated labels placed on
|
|
instructions or data offsets that are the target of operands. They're
|
|
formed by appending the hexadecimal address to the letter "L", with
|
|
additional characters added if some other symbol has already defined
|
|
that label. Options can be set that change the "L" to a character or
|
|
characters based on how the label is referenced, e.g. "B" for branch targets.
|
|
Auto labels are only added where they are needed, and are removed when
|
|
no longer necessary. Because auto labels may be renamed or vanish, the
|
|
editor will try to prevent you from referring to them when editing
|
|
operands.</p>
|
|
|
|
|
|
<h3><a name="local-vars">Local Variables</a></h3>
|
|
|
|
<p>Local variables are applied to instructions that have zero
|
|
page operands (<code>op ZP</code>, <code>op (ZP),Y</code>, etc.), or
|
|
65816 stack relative operands
|
|
(<code>op OFF,S</code> or <code>op (OFF,S),Y</code>). While they must be
|
|
unique relative to other kinds of labels, they don't have to be unique
|
|
with respect to earlier variable definitions. So you can define
|
|
<code>TMP .EQ $10</code>, and a few lines later define
|
|
<code>TMP .EQ $20</code>. This is handy because zero-page addresses are
|
|
often used in different ways by different parts of the program. For
|
|
example:</p>
|
|
<pre>
|
|
LDA ($00),Y
|
|
INC $02
|
|
... later ...
|
|
DEC $00
|
|
STA ($01),Y
|
|
</pre>
|
|
<p>If we had given <code>$00</code> the label <code>PTR</code> and
|
|
<code>$02</code> the label <code>COUNT</code> globally,
|
|
the second pair of instructions would look all wrong. With local
|
|
variable tables you can set <code>PTR=$00 COUNT=$02</code> for the first chunk,
|
|
and <code>COUNT=$00 PTR=$01</code> for the second chunk.</p>
|
|
|
|
<p>Local variables have a value and a width. If we create a pair of
|
|
variable definitions like this:</p>
|
|
<pre>
|
|
PTR .eq $00 ;2 bytes
|
|
COUNT .eq $02 ;1 byte
|
|
</pre>
|
|
<p>Then this:</p>
|
|
<pre>
|
|
STA $00
|
|
STX $01
|
|
LDY $02
|
|
</pre>
|
|
<p>Would become:</p>
|
|
<pre>
|
|
STA PTR
|
|
STX PTR+1
|
|
LDY COUNT
|
|
</pre>
|
|
|
|
<p>The scope of a variable definition starts at the point where it is
|
|
defined, and stops when its definition is erased. There are three
|
|
ways for a table to erase an earlier definition:</p>
|
|
<ol>
|
|
<li>Create a new definition with the same name.</li>
|
|
<li>Create a new definition that has an overlapping value. For
|
|
example, if you have a two-byte variable <code>PTR = $00</code>,
|
|
and define a one-byte variable <code>COUNT = $01</code>, the
|
|
definition for <code>PTR</code> will be cleared because its second
|
|
byte overlaps.</li>
|
|
<li>Tables have a "clear previous" flag that erases all previous
|
|
definitions. This doesn't usually cause anything to be generated in the
|
|
assembly sources; instead, it just causes SourceGen to stop using
|
|
that label.</li>
|
|
</ol>
|
|
<p>As you might expect, you're not allowed to have duplicate labels or
|
|
overlapping values in an individual table.</p>
|
|
<p>If a platform/project symbol has the same value as a local variable,
|
|
the local variable is used. If the local variable definition is cleared,
|
|
use of the platform/project symbol will resume.</p>
|
|
<p>Not all assemblers support redefinable variables. In those cases,
|
|
the symbol names will be modified to be unique (e.g. the second definition
|
|
of <code>PTR</code> becomes <code>PTR_1</code>), and variables will have
|
|
global scope.</p>
|
|
|
|
|
|
<h3><a name="weak-refs">Weak References</a></h3>
|
|
|
|
<p>Symbolic references in operands are "weak references". If the named
|
|
symbol exists, the reference is used. If the symbol can't be found, the
|
|
operand is formatted in hex instead.</p>
|
|
|
|
<p>It's important to know this when editing a project. Consider the
|
|
following trivial chunk of code:</p>
|
|
|
|
<pre>
|
|
1000: 4c0310 JMP $1003
|
|
1003: ea NOP
|
|
</pre>
|
|
|
|
<p>When you load it into SourceGen, it will be formatted like this:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1003
|
|
L1003 NOP
|
|
</pre>
|
|
|
|
<p>The analyzer found the JMP operand, and created an auto label for
|
|
address $1003. It then created a weak reference to "L1003" in the JMP
|
|
operand.</p>
|
|
|
|
<p>If you edit the JMP instruction's operand to use the symbol "FOO", the
|
|
results are probably not what you want:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP $1003
|
|
NOP
|
|
</pre>
|
|
|
|
<p>This happened because you added a weak reference to "FOO" in the operand,
|
|
but the label doesn't exist. The operand is formatted as hex. Because
|
|
there's no longer a reference to L1003, SourceGen removed the auto-label
|
|
as well.</p>
|
|
|
|
<p>If you set the label "FOO" on the NOP instruction, you'll see what you
|
|
probably wanted:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP FOO
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>You don't actually need the explicit reference in the JMP instruction.
|
|
If you edit the JMP operand and set it back to "Default", the code will
|
|
still look the same. This is because SourceGen identified the numeric
|
|
reference, and automatically added a symbolic reference to the label on
|
|
the NOP instruction.</p>
|
|
|
|
<p>However, suppose you didn't actually want FOO as the operand label.
|
|
You can create a project symbol, BAR with the value $1003, and then edit
|
|
the operand to reference BAR instead. Your code would then look like:</p>
|
|
<pre>
|
|
BAR .EQ $1003
|
|
.ORG $1000
|
|
JMP BAR
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>If you change the value of BAR in the project symbol file, the operand
|
|
will continue to refer to it, but with an adjustment. For example, if
|
|
you changed BAR from $1003 to $1007, the code would become:</p>
|
|
<pre>
|
|
BAR .EQ $1007
|
|
.ORG $1000
|
|
JMP BAR-4
|
|
FOO NOP
|
|
</pre>
|
|
|
|
<p>If you rename a label, all references to that label are updated. For
|
|
numeric references that happens implicitly. For explicit operand
|
|
references, the weak references are updated individually. (Modern IDEs
|
|
call this "refactoring".)</p>
|
|
<p>If you remove a label, all of the numeric references to it will
|
|
reference something else, probably a new auto label. Weak references
|
|
to the symbol will break and be formatted as hex, but will not be
|
|
removed. Similarly, removing symbols from a platform or project file
|
|
will break the reference but won't modify the operands.</p>
|
|
|
|
<h3><a name="symbol-parts">Parts and Adjustments</a></h3>
|
|
|
|
<p>Sometimes you want to use part of a label, or adjust the value slightly.
|
|
(I use "adjustment" rather than "offset" to avoid confusing it with file
|
|
offsets.) Consider the following example:</p>
|
|
<pre>
|
|
1000: a910 LDA #$10
|
|
1002: 48 PHA
|
|
1003: a906 LDA #$06
|
|
1005: 48 PHA
|
|
1006: 60 RTS
|
|
1007: 4c3aff JMP $ff3a
|
|
</pre>
|
|
|
|
<p>This pushes the address of the JMP instruction ($1007) onto the stack,
|
|
and jumps to it with the RTS instruction. However, RTS requires the
|
|
address of the byte before the target instruction, so we actually push
|
|
$1006.</p>
|
|
|
|
<p>The disassembler won't know that offset $1007 is code because nothing
|
|
appears to reference it. After adding a code hint at $1007, the project
|
|
looks like this:</p>
|
|
<pre>
|
|
LDA #$10
|
|
PHA
|
|
LDA #$06
|
|
PHA
|
|
RTS
|
|
|
|
JMP $ff3a
|
|
</pre>
|
|
|
|
<p>We set a label called "NEXT" on the JMP instruction, and then edit
|
|
the two LDA instructions to reference the high and low parts, yielding:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #>NEXT
|
|
PHA
|
|
LDA #<NEXT-1
|
|
PHA
|
|
RTS
|
|
|
|
NEXT JMP $ff3a
|
|
</pre>
|
|
|
|
<p>SourceGen will adjust label values by whatever amount is required to
|
|
generate the original value. If the adjustment seems wrong, make sure
|
|
you're selecting the right part of the symbol.</p>
|
|
|
|
<p>Different assemblers use different syntaxes to form expressions. This
|
|
is particularly noticeable in 65816 code. You can adjust how it appears
|
|
on-screen from the app settings.</p>
|
|
|
|
<h3><a name="nearby-targets">Automatic Use of Nearby Targets</a></h3>
|
|
|
|
<p>Sometimes you want to use a symbol that doesn't match up with the
|
|
operand. SourceGen tries to anticipate situations where that might be
|
|
the case, and apply adjustments for you.</p>
|
|
|
|
<p>Suppose you have the following:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #$00
|
|
STA L1010
|
|
LDA #$20
|
|
STA L1011
|
|
LDA #$e1
|
|
STA L1012
|
|
RTS
|
|
|
|
L1010 .DD1 $00
|
|
L1011 .DD1 $00
|
|
L1012 .DD1 $00
|
|
</pre>
|
|
|
|
<p>Showing stores to three different labeled addresses is fine, but
|
|
the code is actually setting up a single 24-bit address. For clarity,
|
|
you'd like the output to reflect the fact that it's a single, multi-byte
|
|
variable. So, if you set a label at $1010, SourceGen removes the
|
|
nearby auto labels, and sets the numeric references to use your label:</p>
|
|
|
|
<pre>
|
|
.ORG $1000
|
|
LDA #$00
|
|
STA DATA
|
|
LDA #$20
|
|
STA DATA+1
|
|
LDA #$e1
|
|
STA DATA+2
|
|
RTS
|
|
|
|
DATA .DD1 $00
|
|
.DD1 $00
|
|
.DD1 $00
|
|
</pre>
|
|
|
|
<p>If you decide that you really wanted each store to have its own
|
|
label, you can set labels on the other two addresses. SourceGen won't
|
|
search for alternate labels if the numeric reference target has a
|
|
user-defined label.</p>
|
|
|
|
<p>This is also used for self-modifying code. For example:</p>
|
|
<pre>
|
|
1000: a9ff LDA #$ff
|
|
1002: 8d0610 STA $1006
|
|
1005: 4900 EOR #$00
|
|
</pre>
|
|
|
|
<p>The above changes the <code>EOR #$00</code> instruction to
|
|
<code>EOR #$ff</code>. The operand target is $1006, but we can't
|
|
put a label there because it's in the middle of the instruction. So
|
|
SourceGen puts a label at $1005 and adjusts it:</p>
|
|
<pre>
|
|
LDA #$ff
|
|
STA L1005+1
|
|
L1005 EOR #$00
|
|
</pre>
|
|
|
|
<p>If you really don't like the way this works, you can disable the
|
|
search for nearby targets entirely from the
|
|
<a href="settings.html#project-properties">project properties</a>.
|
|
Self-modifying code will always be adjusted because of the limitation
|
|
on mid-instruction labels.</p>
|
|
|
|
|
|
<h2><a name="width-disambiguation">Width Disambiguation</a></h2>
|
|
|
|
<p>It's possible to interpret certain instructions in multiple ways.
|
|
For example, "LDA $0000" might be an absolute load from a 16-bit
|
|
address, or it might be a direct page load from an 8-bit address.
|
|
Humans can infer from the fact that it was written with a 4-digit address
|
|
that it's meant to be absolute, but assemblers often treat operands
|
|
purely as numbers, and would just see "LDA 0". Common practice is to
|
|
use the shortest instruction possible.</p>
|
|
<p>Every assembler seems to address the problem in a slightly different
|
|
way. Some use opcode suffixes, others use operand prefixes, some
|
|
allow both. You can configure how they appear in the
|
|
<a href="settings.html#app-settings">application settings</a>.</p>
|
|
<p>SourceGen will only add width disambiguators to opcodes or operands when
|
|
they are needed, with one exception: the opcode suffix for long
|
|
(24-bit address) operations is always applied. This is done because some
|
|
assemblers require it, insisting on "LDAL" rather than "LDA" for an
|
|
absolute long load, and because it can make 65816 code easier to read.</p>
|
|
|
|
|
|
<h2><a name="pseudo-ops">Data and Directive Pseudo-Opcodes</a></h2>
|
|
|
|
<p>There are only four assembler directives that appear in the code list:</p>
|
|
<ul>
|
|
<li>.EQ - defines a symbol's value. These are generated automatically
|
|
when an operand that matches a platform or project symbol is found.</li>
|
|
<li>.VAR - defines a local variable. These are generated for
|
|
local variable tables.</li>
|
|
<li>.ORG - changes the target address.</li>
|
|
<li>.RWID - specifies the width of the accumulator and index registers
|
|
(65816 only). Note this doesn't change the actual width, just tells
|
|
the assembler that the width has changed.</li>
|
|
</ul>
|
|
|
|
<p>Every data item is represented by a pseudo-op. Some of them may
|
|
represent hundreds of bytes and span multiple lines.</p>
|
|
<ul>
|
|
<li>.DD1, .DD2, .DD3, .DD4 - basic "define data" op. A 1-4 byte
|
|
little-endian value.</li>
|
|
<li>.DBD2, .DBD3, .DBD4 - "define big-endian data". 2-4 bytes of
|
|
big-endian data.</li>
|
|
<li>.BULK - data packed in as compact a form as the assembler allows.
|
|
Useful for chunks of graphics data.</li>
|
|
<li>.FILL - a series of identical bytes. The operand
|
|
has two parts, the byte count followed by the byte value.</li>
|
|
</ul>
|
|
|
|
<p>In addition, several pseudo-ops are defined for string constants:</p>
|
|
<ul>
|
|
<li>.STR - basic character string.</li>
|
|
<li>.RSTR - string in reverse order.</li>
|
|
<li>.ZSTR - null-terminated string.</li>
|
|
<li>.DSTR - Dextral Character Inverted string. The high bit of the
|
|
last byte is flipped.</li>
|
|
<li>.L1STR - string prefixed with a length byte.</li>
|
|
<li>.L2STR - string prefixed with a length word.</li>
|
|
</ul>
|
|
|
|
<p>You can configure the pseudo-operands to look more like what your
|
|
favorite assembler uses in the
|
|
<a href="settings.html#appset-pseudoop">Pseudo-Op</a> tab in the
|
|
application settings.</p>
|
|
|
|
<p>String constants start and end with delimiter characters, typically
|
|
single or double quotes. You can configure the delimiters differently
|
|
for each character encoding, so that it's obvious whether the text is
|
|
in ASCII or PETSCII. See the
|
|
<a href="settings.html#appset-textdelim">Text Delimiters</a> tab in
|
|
the application settings.</p>
|
|
|
|
|
|
</div>
|
|
|
|
<div id="footer">
|
|
<p><a href="index.html">Back to index</a></p>
|
|
</div>
|
|
</body>
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
</html>
|