2018-09-28 17:05:11 +00:00
|
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
|
|
|
|
<head>
|
|
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
|
|
<title>Intro - 6502bench SourceGen</title>
|
|
|
|
</head>
|
|
|
|
|
|
|
|
<body>
|
|
|
|
<div id=content>
|
|
|
|
<h1>6502bench SourceGen: Intro</h1>
|
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
|
|
|
|
<h2><a name="overview">Overview</a></h2>
|
|
|
|
|
|
|
|
<p>SourceGen converts 6502/65C02/65816 machine-language programs to
|
|
|
|
assembly-language source.</p>
|
|
|
|
|
|
|
|
<p>SourceGen has two purposes. The first is to be a really nice
|
|
|
|
disassembler for the 6502 and related CPUs. Code tracing with status
|
|
|
|
flag tracking makes it easier to separate the code from the data,
|
|
|
|
automatic formatting of ASCII strings and filled-data areas helps
|
|
|
|
get the data regions sorted out, and modern IDE-style features like
|
|
|
|
cross-reference generation and color-highlighted bookmarks help
|
|
|
|
navigate the code while trying to figure out what it does. A
|
|
|
|
disassembler should help you understand the code, not just dump the
|
|
|
|
instructions to a text file.</p>
|
|
|
|
<p>The computer I built in 2014 has a 4GHz CPU and 8GB of RAM.
|
2018-10-04 01:03:04 +00:00
|
|
|
I figured we should put that kind of power to good use.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>The second purpose is to facilitate sharing and collaboration. Most
|
|
|
|
disassemblers generate output for a specific assembler, or in a way that's
|
|
|
|
generic enough to match most any assembler; either way, you're left with
|
|
|
|
a text file in somebody's idea of the "correct" format. SourceGen keeps
|
|
|
|
everything in an assembler-neutral format, and provides numerous options
|
|
|
|
for customizing the display, so that multiple people viewing the same
|
|
|
|
project can each do so with the conventions they are accustomed to.
|
|
|
|
Code and data operands can be formatted in various numeric formats or
|
|
|
|
as symbols.
|
|
|
|
The project file uses a text format that is fairly diff-friendly, so
|
|
|
|
sharing projects through git works reasonably well. If you want source
|
|
|
|
code you can assemble, SourceGen will generate code optimized for the
|
|
|
|
assembler of your choice.</p>
|
|
|
|
|
|
|
|
<p>The sharing and collaboration ideas only work if the formatting
|
|
|
|
capabilities within SourceGen are sufficiently flexible. If you need to
|
|
|
|
generate assembly source and tweak it a bunch to express the intent of
|
|
|
|
the original code, then passing a SourceGen project around won't work.
|
|
|
|
This sort of thing is a bit outside the bounds of what a typical
|
2018-10-04 01:03:04 +00:00
|
|
|
disassembler does, so it remains to be seen whether SourceGen succeeds at
|
|
|
|
what it's trying to do, and also whether what it's trying to do is
|
|
|
|
something that people actually want.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>You can get started by watching the
|
|
|
|
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and playing with the
|
|
|
|
<a href="tutorials.html">tutorials</a>.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="fundamental-concepts">Fundamental Concepts</a></h2>
|
|
|
|
|
|
|
|
<p>The next few sections present some general concepts and terminology. The
|
|
|
|
rest of the documentation assumes you've read and understood this. It will
|
|
|
|
be helpful if you already understand something about the 6502 instruction
|
|
|
|
set and assembly-language programming, but disassembling other programs is
|
2018-10-04 01:03:04 +00:00
|
|
|
actually a pretty good way to learn how to code in assembly.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<h2><a name="begin">About 6502 Code</a></h2>
|
|
|
|
|
|
|
|
<p>For brevity's sake, "6502 code" should be taken to mean "code for
|
|
|
|
the 6502 CPU or any of its derivatives, including but not limited to
|
|
|
|
the 65C02 and 65816". So let's talk about 6502 code.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Code usually arrives in a big binary blob. Some of it will be
|
|
|
|
instructions, some of it will be data, some will be empty space used
|
|
|
|
for variable storage. Part of the challenge of disassembly is
|
|
|
|
identifying which parts of the file contain which.</p>
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<p>Much of the code you'll find for the 6502 was written by humans,
|
|
|
|
rather than generated by a compiler, which means it won't conform to a
|
2018-10-04 01:03:04 +00:00
|
|
|
standard set of conventions. However, most programmers will use
|
|
|
|
subroutines, which can be identified and analyzed in isolation. Subroutines
|
|
|
|
are often interspersed with variable storage, referred to as a "stash".
|
2018-09-28 17:05:11 +00:00
|
|
|
Variables may be single-byte or multi-byte, the latter typically
|
|
|
|
in little-endian byte order.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Much of the data in a typical program is read-only, often in the
|
|
|
|
form of graphics or ASCII string data. Graphics can be difficult
|
|
|
|
to recognize automatically, but strings can be identified with a
|
|
|
|
reasonable degree of confidence. Address tables, which are a collection
|
|
|
|
of addresses to other things, are also fairly common.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>A simple disassembler would start at the top of the file and just
|
|
|
|
start converting bytes to instructions. Unfortunately there's no reliable
|
|
|
|
way to tell the difference between instructions, data, and variable
|
|
|
|
stashes. When the converter hits data bytes it'll start generating
|
|
|
|
instructions that won't make sense. You'll have another problem when the
|
|
|
|
data ends and code resumes: 6502 instructions are variable-length, so if
|
|
|
|
the last byte of the data area appears to be a three-byte instruction,
|
|
|
|
the first two bytes of the next instruction area will be gobbled up.</p>
|
|
|
|
|
|
|
|
<p>Some programmers will use a trick where they "embed" an instruction
|
|
|
|
inside another instruction. This allows code to branch to two different
|
|
|
|
entry points, one of which will set a flag or load a register, and then
|
|
|
|
continue on to common code.</p>
|
|
|
|
|
|
|
|
<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
|
|
|
|
The caller pulls the calling address off the stack, uses it to access
|
|
|
|
the parameters, then pushes the address back on after modifying it to
|
|
|
|
point to an address past the inline data. This can be very confusing
|
|
|
|
for the disassembler, which will try to interpret the inline data as
|
|
|
|
instructions.</p>
|
|
|
|
|
|
|
|
<p>Sometimes code is loaded at one location, then moved to another and
|
|
|
|
executed there. If you're disassembling an executing program you don't
|
|
|
|
have to worry about this, but if you're disassembling the binary from the
|
|
|
|
loadable file on disk then you need to track the address changes. The
|
|
|
|
address is communicated to the assembler with a "pseudo-opcode", usually
|
|
|
|
something like "ORG". Other pseudo-op directives are used to define external
|
|
|
|
symbols and (for 65816 code) register widths.</p>
|
|
|
|
|
|
|
|
<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
|
|
|
|
range from $0000 to $ffff. (I'm going to write hex values with a
|
|
|
|
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
|
|
|
|
that's what 6502 systems commonly used.) The 65816 has a 24-bit address
|
|
|
|
space, but it's not contiguous -- a branch that extends past the end will
|
|
|
|
wrap around to the start of the 64KiB "bank". For 16-bit instruction
|
|
|
|
operands, the bank is identified for instruction and data addresses
|
|
|
|
by the program bank register and the data bank register, respectively.
|
|
|
|
The disassembler can't generally know the contents of the data bank
|
|
|
|
register, which makes life a bit more interesting.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
|
|
|
|
in it. Some of the flags determine whether a conditional branch is taken
|
|
|
|
or not, which is important because some branches appear to be conditional
|
|
|
|
but actually are always or never taken in practice. The disassembler needs
|
|
|
|
to be able to figure this out so that it doesn't try to disassemble the
|
|
|
|
bytes that follow an always-taken branch.
|
|
|
|
A more significant concern is the M and X flags found on the 65802/65816,
|
|
|
|
which determine the width of the registers and of immediate load
|
|
|
|
instructions. If you don't know what state the flags are in, you can't
|
|
|
|
know whether <code>LDA #value</code> is two bytes or three, and the
|
|
|
|
disassembly of the instruction stream will come out wrong.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="sgintro">How SourceGen Works</a></h2>
|
|
|
|
|
|
|
|
<p>SourceGen employs a partial emulation technique that traces the flow
|
|
|
|
of execution. Most of what a given instruction does isn't important;
|
|
|
|
only its effect on the flow of execution matters.
|
|
|
|
|
|
|
|
<p>The code tracing has to start somewhere, so SourceGen uses "code entry
|
|
|
|
point hints" to identify places where execution may begin. By default,
|
2018-10-04 01:03:04 +00:00
|
|
|
a hint is placed at the start of the file. From there, the tracing process
|
2018-09-28 17:05:11 +00:00
|
|
|
walks through the code, pursuing all branches. In many cases, if you
|
2018-10-04 01:03:04 +00:00
|
|
|
mark all external entry points, SourceGen will automatically find all
|
2018-09-28 17:05:11 +00:00
|
|
|
executable code and separate it from variable storage and data areas.</p>
|
|
|
|
|
|
|
|
<p>As noted earlier, tracking the processor status flags can make the
|
|
|
|
analysis more accurate. Identifying situations where a branch instruction
|
|
|
|
is always or never taken avoids mis-categorizing a data region as code.
|
|
|
|
On the 65816, it's absolutely crucial to track the M/X flags, since those
|
|
|
|
affect the width of instructions. SourceGen tracks the value of the
|
2018-10-04 01:03:04 +00:00
|
|
|
processor flags at every instruction, blending sets of flags together when
|
2018-09-28 17:05:11 +00:00
|
|
|
multiple paths of execution converge.</p>
|
|
|
|
|
|
|
|
<p>Once instructions and data have been separated, the instruction operands
|
|
|
|
can be examined. Branches, loads, and stores that reference an address
|
|
|
|
that falls inside the address space covered by the file can be replaced
|
|
|
|
with a symbol. Operands that refer to addresses outside the file, such
|
|
|
|
as ROM or operating system routines, can be replaced with a symbol defined
|
|
|
|
by an equate directive.</p>
|
|
|
|
|
|
|
|
(For more details on how this works, see the
|
|
|
|
<a href="analysis.html">analysis appendix</a>.)
|
|
|
|
|
|
|
|
|
|
|
|
<h3><a name="scripts">Extension Scripts</a></h3>
|
|
|
|
|
|
|
|
<p>Extension scripts are C# source files that are compiled and
|
2018-10-04 01:03:04 +00:00
|
|
|
executed by SourceGen. They can be added to a project from SourceGen's
|
|
|
|
runtime data directory, or can live in the directory next to the project
|
|
|
|
file.</p>
|
|
|
|
<p>In the current implementation, scripts are only called to examine
|
|
|
|
JSR/JSL instructions. They can format nearby bytes as inline data, or
|
|
|
|
apply symbols to operands.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>To reduce the chances of a script causing problems, all scripts are
|
|
|
|
executed in a sandbox with severely restricted access. Notably, nothing
|
2018-10-04 01:03:04 +00:00
|
|
|
in the sandbox can access files, except to read files from the PluginDll
|
2018-09-28 17:05:11 +00:00
|
|
|
directory.</p>
|
|
|
|
<p>The PluginDll directory lives next to the SourceGen executable, and
|
|
|
|
contains all of the compiled script DLLs, as well as two pre-built
|
|
|
|
application DLLs that plugins are allowed access to. The contents
|
|
|
|
are persistent, to avoid recompiling the scripts every time SourceGen
|
|
|
|
is launched, but may be manually deleted without harm.</p>
|
|
|
|
|
|
|
|
|
|
|
|
<h3><a name="hints">Analyzer Hints</a></h3>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Sometimes SourceGen can't automatically find the start or end of an
|
|
|
|
instruction stream, or gets confused by inline data. These situations
|
|
|
|
can be resolved by adding an appropriate hint.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p><b>Code entry point hints</b> tell the analyzer to add the offset
|
|
|
|
to the list of instruction start points. Suppose you've got a code
|
|
|
|
library that begins with jump vectors, like this:</p>
|
|
|
|
<pre>
|
|
|
|
1000: 4c0910 JMP $1009
|
|
|
|
1003: 4cef10 JMP $10ef
|
|
|
|
1006: 4c3012 JMP $1230
|
|
|
|
1009: 18 CLC
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>When opened with SourceGen, it will look like this:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP L1009
|
|
|
|
|
|
|
|
.DD1 $4c
|
|
|
|
.DD1 $ef
|
|
|
|
.DD1 $10
|
|
|
|
.DD1 $4c
|
|
|
|
.DD1 $30
|
|
|
|
.DD1 $12
|
|
|
|
L1009 CLC
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>SourceGen doesn't see any code that jumps to $1003 or $1006, so it
|
|
|
|
assumes those are data. Further, the functions at those addresses may
|
|
|
|
also be considered data unless some bit of code reachable from L1009
|
|
|
|
calls into them. If you add a code hint to $1003 and $1006,
|
|
|
|
you'll get better results:
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP L1009
|
|
|
|
JMP L10ef
|
|
|
|
JMP L1230
|
|
|
|
L1009 CLC
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>Be careful that you only add hints to the instruction opcode. If
|
2018-10-08 04:51:15 +00:00
|
|
|
you applied hints to the full range of bytes from $1003 to $1008, you would
|
2018-09-28 17:05:11 +00:00
|
|
|
end up with this:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP L1009
|
2018-10-04 01:03:04 +00:00
|
|
|
JMP ⏩ L10ef
|
|
|
|
BPL ⏩ L1053
|
|
|
|
JMP ⏩ L1230
|
2018-09-28 17:05:11 +00:00
|
|
|
BMI L101b
|
|
|
|
L1009 CLC
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>The exact set of instructions shown depends on your CPU configuration.
|
|
|
|
The problem is that the bytes in the middle of the instruction have
|
|
|
|
been marked as entry points, and SourceGen is treating them as
|
|
|
|
embedded instructions. $EF and $12 aren't valid 6502 opcodes, so
|
2018-10-08 04:51:15 +00:00
|
|
|
they're being ignored, but $10 is BPL and $30 is BMI. Because hinting
|
|
|
|
multiple consecutive bytes is rarely useful, SourceGen only applies code
|
|
|
|
hints to the first byte in a selected line.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p><b>Data hints</b> tell the analyzer when it should stop. For example,
|
|
|
|
suppose address $ff00 is known to always be nonzero, and the code uses
|
|
|
|
that fact to get a branch-always on the 6502:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
LDA $ff00
|
|
|
|
BNE L1010
|
|
|
|
BRK $11
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>By placing a data hint on the BRK, you're telling the analyzer that
|
|
|
|
it should stop the current path of execution. (Note that this example
|
|
|
|
would actually be better solved by setting a status flag override on
|
|
|
|
the BNE that sets Z=0, so the code tracer will know it's a branch-always
|
|
|
|
and do the right thing.) It's only necessary to place a hint on the
|
|
|
|
very first (opcode) byte. Placing a data hint in the middle of what
|
2018-10-04 01:03:04 +00:00
|
|
|
SourceGen believes to be instruction will have no effect.</p>
|
2018-10-08 04:51:15 +00:00
|
|
|
<p>As with code hints, only the first byte in each selected line will
|
|
|
|
be hinted.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p><b>Inline data hints</b> identify bytes as being part of the
|
|
|
|
instruction stream, but not instructions. A simple example of this
|
|
|
|
is the ProDOS 8 call interface on the Apple II, which looks like this:</p>
|
|
|
|
<pre>
|
|
|
|
JSR $bf00
|
|
|
|
.DD1 $function
|
|
|
|
.DD2 $address
|
2018-10-04 01:03:04 +00:00
|
|
|
BCS BAD
|
2018-09-28 17:05:11 +00:00
|
|
|
</pre>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>The three bytes following the <code>JSR $bf00</code> should be hinted
|
|
|
|
as inline data, so that the code analyzer skips them and continues the
|
2018-10-08 04:51:15 +00:00
|
|
|
analysis at the <code>BCS</code>. Because you need to hint *every* byte
|
|
|
|
of inline data, all bytes in a selected line will receive hints.</p>
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>If code branches into a region that is marked as inline data, the
|
2018-09-28 17:05:11 +00:00
|
|
|
branch will be ignored.</p>
|
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="sgconcepts">SourceGen Concepts</a></h2>
|
|
|
|
|
|
|
|
<p>As you work on a disassembled file, formatting operands and adding
|
|
|
|
comments, everything you do is saved in the project file as "meta data".
|
|
|
|
None of the data from the file being disassembled is included. This
|
|
|
|
should allow project files to be shared without violating the copyright
|
|
|
|
of the work being disassembled. (This will vary by region. Also, note
|
|
|
|
that the mere act of disassembling a piece of software may be illegal in
|
|
|
|
some cases.)</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>To avoid mix-ups where the wrong data file is used, the file's length
|
|
|
|
and CRC are stored in the project file. SourceGen will refuse to open a
|
|
|
|
project if the data file's length and CRC don't match.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>Most of the data in the project file is associated with a file offset.
|
|
|
|
When you create a comment, you aren't associating it with line 53, you're
|
|
|
|
associating it with the 127th byte in the file. This ensures that, as the
|
|
|
|
project evolves, the comment you wrote is always connected to the
|
|
|
|
same instruction or data item. This also means you can't have two
|
|
|
|
comments on the same line -- each offset only has room for one. By
|
|
|
|
convention, file offsets are always shown as a six-digit hexadecimal value
|
|
|
|
with a leading '+', e.g. "+0012ab". This makes it easy to distinguish
|
|
|
|
between an address and a offset.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Instruction and data operands can be formatted in various ways. The
|
|
|
|
formatting choice is associated with the first offset of the item. For
|
|
|
|
instructions the number of bytes in the operand is determined by the opcode
|
|
|
|
(and, on the 65816, the M/X status flags). For data items the length
|
|
|
|
can be a single byte or an entire file. Operand formats are not allowed
|
|
|
|
to overlap.</p>
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<p>When an instruction or data operand references an address, we call
|
|
|
|
it a <b>numeric reference</b>. When the target address has a label, and
|
|
|
|
the operand uses that symbol, we call that a <b>symbolic reference</b>.
|
|
|
|
SourceGen tries to establish symbolic references whenever possible,
|
|
|
|
so that the generated assembly source doesn't refer to hard-coded
|
2018-10-04 01:03:04 +00:00
|
|
|
locations within the program. Labels are generated automatically for
|
|
|
|
the targets of numeric references.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>As your understanding of the disassembled code develops, you will want
|
|
|
|
to add comments explaining it. SourceGen projects have three kinds of
|
|
|
|
comments:</p>
|
|
|
|
<ol>
|
|
|
|
<li>End-of-line comments. As the name implies, these appear at the
|
|
|
|
end of a line, to the right of the opcode or operand.
|
|
|
|
<li>Long comments, also known as multi-line comments. These get a
|
|
|
|
line all to themselves, and may span multiple lines.
|
|
|
|
<li>Notes. Like long comments, these get a line to themselves. Unlike
|
|
|
|
long comments, these do not appear in generated assembly code. They
|
|
|
|
are a way for you to leave notes to yourself, perhaps "don't forget
|
|
|
|
to figure this out" or "this is the cool part".
|
|
|
|
</ol>
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Every file offset can have one of each.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>Labels and comments may disappear if you associate them with a file
|
|
|
|
offset that is in the middle of a multi-byte instruction or data item.
|
|
|
|
For example, suppose you put a long comment at offset +000010, and then
|
|
|
|
mark a 50-byte region starting at offset +000008 as an ASCII string. The
|
|
|
|
comment won't be deleted, but won't be displayed either. The same thing
|
2018-10-04 01:03:04 +00:00
|
|
|
can happen to labels. SourceGen will try to prevent this from happening
|
|
|
|
by splitting formatted data into sub-regions at label boundaries.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="about-symbols">All About Symbols</a></h2>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>A symbol has two parts, a label and a value. The label is a short
|
|
|
|
ASCII string; the value may be an 8-to-24-bit address or a numeric
|
|
|
|
constant. Symbols can be defined in different ways, and applied in
|
|
|
|
different ways.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>The label syntax is restricted to a format that should be compatible
|
|
|
|
with most assemblers:</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
<ul>
|
|
|
|
<li>2-32 characters long.
|
|
|
|
<li>Starts with a letter or underscore.
|
|
|
|
<li>Comprised of ASCII letters, numbers, and the underscore.
|
|
|
|
</ul>
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Label comparisons are case-sensitive, as is customary for programming
|
|
|
|
languages.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p><b>Platform symbols</b> are defined in platform symbol files. These
|
|
|
|
are named with a ".sym65" extension, and have a fairly straightforward
|
|
|
|
name/value syntax. Several files for popular platforms come with SourceGen
|
|
|
|
and live in the <code>RuntimeData</code> directory. You can also create your
|
2018-09-28 17:05:11 +00:00
|
|
|
own, but they have to live in the same directory as the project file.</p>
|
|
|
|
|
|
|
|
<p>Platform symbols can be addresses or constants. If an instruction
|
|
|
|
or data operand references an address outside the scope of the data
|
|
|
|
file, SourceGen looks for a symbol with a matching address value. If
|
|
|
|
it finds one, it automatically uses that symbol. Symbolic constants
|
|
|
|
can be used the same way, but are not matched automatically. This makes
|
|
|
|
them useful for things like operating system function numbers.</p>
|
|
|
|
|
|
|
|
<p>If two platform symbols have the same value, the one whose label comes
|
|
|
|
first alphabetically is used. If two platform symbols have the same label,
|
|
|
|
the most recently read one is kept.</p>
|
|
|
|
|
|
|
|
<p><b>Project symbols</b> behave like platform symbols, but they are
|
|
|
|
defined in the project file itself. The editor will prevent you from
|
|
|
|
creating two symbols with the same name. If two symbols have the same
|
|
|
|
value, the one whose label comes first alphabetically is used.</p>
|
|
|
|
|
|
|
|
<p>Project symbols always have precedence over platform symbols, allowing
|
2018-10-04 01:03:04 +00:00
|
|
|
you to redefine symbols within a project. (You can "hide" a platform
|
2018-09-28 17:05:11 +00:00
|
|
|
symbol by creating a project symbol with the same name and an unused
|
|
|
|
value, such as $ffffffff.)</p>
|
|
|
|
|
|
|
|
<p><b>User labels</b> are labels added to instructions or data by the user.
|
|
|
|
The editor won't allow you to add a label that conflicts, but if you
|
|
|
|
manage to do so, the user label takes precedence over project and platform
|
|
|
|
symbols. User labels may be tagged as local, global, or global and
|
|
|
|
exported. Local vs. global is important for the label localizer, while
|
|
|
|
exported symbols can be pulled directly into other projects.</p>
|
|
|
|
|
|
|
|
<p><b>Auto labels</b> are automatically generated labels placed on
|
|
|
|
instructions or data offsets that are the target of operands. They're
|
|
|
|
formed by appending the hexadecimal address to the letter "L", with
|
|
|
|
additional characters added if some other symbol has already defined
|
|
|
|
that label. Auto labels are only added where they are needed. Because
|
2018-10-04 01:03:04 +00:00
|
|
|
auto labels may be redefined or disappear, the editor will try to prevent
|
|
|
|
you from referring to them when editing operands.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>Operands may use parts of symbols. For example, if you have a label
|
|
|
|
<code>MYSTRING</code>, you can write:</p>
|
|
|
|
<pre>
|
|
|
|
MYSTRING .STR "hello"
|
|
|
|
LDA #<MYSTRING
|
|
|
|
STA $00
|
|
|
|
lda #>MYSTRING
|
|
|
|
STA $01
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>The format editor allows you to choose which part of the symbol's
|
2018-10-04 01:03:04 +00:00
|
|
|
value to use. If the value doesn't match exactly, an adjustment will
|
2018-09-28 17:05:11 +00:00
|
|
|
be applied.</p>
|
|
|
|
|
|
|
|
<h3><a name="weak-refs">Weak References</a></h3>
|
|
|
|
|
|
|
|
<p>Symbolic references in operands are "weak references". If the named
|
|
|
|
symbol exists, the reference is used. If the symbol can't be found, the
|
|
|
|
operand is formatted in hex instead.</p>
|
|
|
|
|
|
|
|
<p>It's important to know this when editing a project. Consider the
|
|
|
|
following trivial chunk of code:
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
1000: 4c0310 JMP $1003
|
|
|
|
1003: ea NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>When you load it into SourceGen, it will be formatted like this:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP L1003
|
|
|
|
L1003 NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>The analyzer found the JMP operand, and created an auto label for
|
|
|
|
address $1003. It then created a weak reference to "L1003" in the JMP
|
|
|
|
operand.</p>
|
|
|
|
|
|
|
|
<p>If you edit the JMP instruction's operand to use the symbol "FOO", the
|
|
|
|
results are probably not what you want:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP $1003
|
|
|
|
NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>This happened because you added a weak reference to "FOO" in the operand,
|
2018-10-04 01:03:04 +00:00
|
|
|
but the label doesn't exist. The operand is formatted as hex. Because
|
|
|
|
there's no longer a reference to L1003, SourceGen removed the auto-label
|
|
|
|
as well.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>If you set the label "FOO" on the NOP instruction, you'll see what you
|
|
|
|
probably wanted:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP FOO
|
|
|
|
FOO NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>You don't actually need the explicit reference in the JMP instruction.
|
|
|
|
If you edit the JMP operand and set it back to "Default", the code will
|
|
|
|
still look the same. This is because SourceGen identified the numeric
|
|
|
|
reference, and automatically added a symbolic reference to the label on
|
|
|
|
the NOP instruction.</p>
|
|
|
|
|
|
|
|
<p>However, suppose you didn't actually want FOO as the operand label.
|
|
|
|
You can create a project symbol, BAR with the value $1003, and then edit
|
|
|
|
the operand to reference BAR instead. Your code would then look like:</p>
|
|
|
|
<pre>
|
|
|
|
BAR .EQ $1003
|
|
|
|
.ORG $1000
|
|
|
|
JMP BAR
|
|
|
|
FOO NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>If you change the value of BAR in the project symbol file, the operand
|
|
|
|
will continue to refer to it, but with an adjustment. For example, if
|
|
|
|
you changed BAR from $1003 to $1007, the code would become:</p>
|
|
|
|
<pre>
|
|
|
|
BAR .EQ $1007
|
|
|
|
.ORG $1000
|
|
|
|
JMP BAR-4
|
|
|
|
FOO NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>If you rename a label, all references to that label are updated. For
|
|
|
|
numeric references that happens implicitly. For explicit operand
|
|
|
|
references, the weak references are updated individually. (Modern IDEs
|
|
|
|
call this "refactoring".)</p>
|
|
|
|
<p>If you remove a label, all of the numeric references to it will
|
|
|
|
reference something else, probably a new auto label. Weak references
|
|
|
|
to the symbol will break and be formatted as hex, but will not be
|
|
|
|
removed. Similarly, removing symbols from a platform or project file
|
|
|
|
will break the reference but won't modify the operands.</p>
|
|
|
|
|
|
|
|
<h3><a name="symbol-parts">Parts and Adjustments</a></h3>
|
|
|
|
|
|
|
|
<p>Sometimes you want to use part of a label, or adjust the value slightly.
|
|
|
|
(I use "adjustment" rather than "offset" to avoid confusing it with file
|
|
|
|
offsets.) Consider the following example:</p>
|
|
|
|
<pre>
|
|
|
|
1000: a910 LDA #$10
|
|
|
|
1002: 48 PHA
|
|
|
|
1003: a906 LDA #$06
|
|
|
|
1005: 48 PHA
|
|
|
|
1006: 60 RTS
|
|
|
|
1007: 4c3aff JMP $ff3a
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>This pushes the address of the JMP instruction ($1007) onto the stack,
|
|
|
|
and jumps to it with the RTS instruction. However, RTS requires the
|
|
|
|
address of the byte before the target instruction, so we actually push
|
|
|
|
$1006.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>The disassembler won't know that offset $1007 is code because nothing
|
|
|
|
appears to reference it. After adding a code hint at $1007, the project
|
|
|
|
looks like this:</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
<pre>
|
|
|
|
LDA #$10
|
|
|
|
PHA
|
|
|
|
LDA #$06
|
|
|
|
PHA
|
|
|
|
RTS
|
|
|
|
|
|
|
|
JMP $ff3a
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>We set a label called "NEXT" on the JMP instruction, and then edit
|
|
|
|
the two LDA instructions to reference the high and low parts, yielding:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
LDA #>NEXT
|
|
|
|
PHA
|
|
|
|
LDA #<NEXT-1
|
|
|
|
PHA
|
|
|
|
RTS
|
|
|
|
|
|
|
|
NEXT JMP $ff3a
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>SourceGen will adjust label values by whatever amount is required to
|
|
|
|
generate the original value. If the adjustment seems wrong, make sure
|
|
|
|
you're selecting the right part of the symbol.</p>
|
|
|
|
|
|
|
|
<p>Different assemblers use different syntaxes to form expressions. This
|
|
|
|
is particularly noticeable in 65816 code. You can adjust how it appears
|
|
|
|
on-screen from the app settings.</p>
|
|
|
|
|
|
|
|
<h3><a name="nearby-targets">Automatic Use of Nearby Targets</a></h3>
|
|
|
|
|
|
|
|
<p>Sometimes you want to use a symbol that doesn't match up with the
|
|
|
|
operand. SourceGen tries to anticipate situations where that might be
|
|
|
|
the case, and apply adjustments for you.</p>
|
|
|
|
|
|
|
|
<p>Suppose you have the following:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
LDA #$00
|
|
|
|
STA L1010
|
|
|
|
LDA #$20
|
|
|
|
STA L1011
|
|
|
|
LDA #$e1
|
|
|
|
STA L1012
|
|
|
|
RTS
|
|
|
|
|
|
|
|
L1010 .DD1 $00
|
|
|
|
L1011 .DD1 $00
|
|
|
|
L1012 .DD1 $00
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>Showing stores to three different labeled addresses is fine, but
|
|
|
|
the code is actually setting up a single 24-bit address. For clarity,
|
|
|
|
you'd like the output to reflect the fact that it's a single, multi-byte
|
|
|
|
variable. So, if you set a label at $1010, SourceGen removes the
|
|
|
|
nearby auto labels, and sets the numeric references to use your label:
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
LDA #$00
|
|
|
|
STA DATA
|
|
|
|
LDA #$20
|
|
|
|
STA DATA+1
|
|
|
|
LDA #$e1
|
|
|
|
STA DATA+2
|
|
|
|
RTS
|
|
|
|
|
|
|
|
DATA .DD1 $00
|
|
|
|
.DD1 $00
|
|
|
|
.DD1 $00
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>If you decide that you really wanted each store to have its own
|
|
|
|
label, you can set labels on the other two addresses. SourceGen won't
|
|
|
|
search for alternate labels if the numeric reference target has a
|
|
|
|
user-defined label.</p>
|
|
|
|
|
|
|
|
<p>This is also used for self-modifying code. For example:</p>
|
|
|
|
<pre>
|
|
|
|
1000: a9ff LDA #$ff
|
|
|
|
1002: 8d0610 STA $1006
|
|
|
|
1005: 4900 EOR #$00
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>The above changes the <code>EOR #$00</code> instruction to
|
|
|
|
<code>EOR #$ff</code>. The operand target is $1006, but we can't
|
|
|
|
put a label there because it's in the middle of the instruction. So
|
|
|
|
SourceGen puts a label at $1005 and adjusts it:</p>
|
|
|
|
<pre>
|
|
|
|
LDA #$ff
|
|
|
|
STA L1005+1
|
|
|
|
L1005 EOR #$00
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>If you really don't like the way this works, you can disable the
|
|
|
|
search for alternate targets entirely from the
|
|
|
|
<a href="settings.html#project-properties">project properties</a>.
|
|
|
|
Self-modifying code will always be adjusted because of the limitation
|
|
|
|
on mid-instruction labels.</p>
|
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="width-disambiguation">Width Disambiguation</a></h2>
|
|
|
|
|
|
|
|
<p>It's possible to interpret certain instructions in multiple ways.
|
|
|
|
For example, "LDA $0000" might be an absolute load from a 16-bit
|
|
|
|
address, or it might be a direct page load from an 8-bit address.
|
|
|
|
Humans can infer from the fact that it was written with a 4-digit address
|
|
|
|
that it's meant to be absolute, but assemblers often treat operands
|
|
|
|
purely as numbers, and would just see "LDA 0". Common practice is to
|
|
|
|
use the shortest instruction possible.</p>
|
|
|
|
<p>Every assembler seems to address the problem in a slightly different
|
|
|
|
way. Some use opcode suffixes, others use operand prefixes, some
|
|
|
|
allow both. You can configure how they appear in the
|
|
|
|
<a href="settings.html#app-settings">application settings</a>.</p>
|
|
|
|
<p>SourcGen will only add width disambiguators to opcodes or operands when
|
|
|
|
they are needed, with one exception: the opcode suffix for long
|
|
|
|
(24-bit address) operations is always applied. This is done because some
|
|
|
|
assemblers require it, insisting on "LDAL" rather than "LDA" for an
|
|
|
|
absolute long load, and because it can make 65816 code easier to read.</p>
|
|
|
|
|
|
|
|
|
|
|
|
<h2><a name="pseudo-ops">Data and Directive Pseudo-Opcodes</a></h2>
|
|
|
|
|
|
|
|
<p>There are only three assembler directives that appear in the code list:</p>
|
|
|
|
<ul>
|
|
|
|
<li>.EQ - defines a symbol's value. These are generated automatically
|
|
|
|
when an operand that matches a platform or project symbol is found.</li>
|
|
|
|
<li>.ORG - changes the target address.</li>
|
|
|
|
<li>.RWID - specifies the width of the accumulator and index registers
|
|
|
|
(65816 only). Note this doesn't change the actual width, just tells
|
|
|
|
the assembler that the width has changed.</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<p>Every data item is represented by a pseudo-op. Some of them may
|
|
|
|
represent hundreds of bytes and span multiple lines.</p>
|
|
|
|
<ul>
|
|
|
|
<li>.DD1, .DD2, .DD3, .DD4 - basic "define data" op. A 1-4 byte
|
|
|
|
little-endian value.</li>
|
|
|
|
<li>.DBD2, .DBD3, .DBD4 - "define big-endian data". 2-4 bytes of
|
|
|
|
big-endian data.</li>
|
|
|
|
<li>.BULK - data packed in as compact a form as the assembler allows.
|
|
|
|
Useful for chunks of graphics data.</li>
|
|
|
|
<li>.FILL - a series of identical bytes. The operand
|
|
|
|
is the byte count, followed by the byte value.
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<p>In addition, several pseudo-ops are defined for string constants:</p>
|
|
|
|
<ul>
|
|
|
|
<li>.STR - basic ASCII string.</li>
|
|
|
|
<li>.RSTR - string in reverse order.</li>
|
|
|
|
<li>.ZSTR - null-terminated string.</li>
|
|
|
|
<li>.DSTR - Dextral Character Inverted string. The high bit of the
|
|
|
|
last byte is flipped.</li>
|
|
|
|
<li>.L1STR - string prefixed with a length byte.</li>
|
|
|
|
<li>.L2STR - string prefixed with a length word.</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<p>If the characters have their high bits set -- commonly referred to
|
|
|
|
as "high ASCII" -- an upward arrow will be added to the pseudo-op. How
|
|
|
|
these strings are generated into assembly source varies.</p>
|
|
|
|
<p>You can configure these to look more like what your favorite assembler
|
|
|
|
uses in the
|
|
|
|
<a href="settings.html#app-settings">application settings</a>.</p>
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div id=footer>
|
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
|
|
</html>
|