6502bench/SourceGen/RuntimeData/Help/intro.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="main.css" rel="stylesheet" type="text/css" />
<title>Intro - 6502bench SourceGen</title>
</head>

<body>
<div id="content">
<h1>6502bench SourceGen: Intro</h1>
<p><a href="index.html">Back to index</a></p>

<h2><a name="overview">Overview</a></h2>

<p>SourceGen converts 6502/65C02/65816 machine-language programs to
assembly-language source.</p>

<p>SourceGen has two purposes.  The first is to be a really nice
disassembler for the 6502 and related CPUs.  Code tracing with status
flag tracking makes it easier to separate the code from the data,
automatic formatting of character strings and filled-data areas helps
get the data regions sorted out, and modern IDE-style features like
cross-reference generation and color-highlighted bookmarks help
navigate the code while trying to figure out what it does.  A
disassembler should help you understand the code, not just dump the
instructions to a text file.</p>
<p>The computer I built back in 2014 has a 4GHz CPU and 8GB of RAM.  I
figured we should put the power of modern computing hardware to good use.</p>

<p>The second purpose is to facilitate sharing and collaboration.  Most
disassemblers generate output for a specific assembler, or in a way that's
generic enough to match most any assembler; either way, you're left with
a text file in somebody's idea of the "correct" format.  SourceGen keeps
everything in an assembler-neutral format, and provides numerous options
for customizing the display, so that multiple people viewing the same
project can each do so with the conventions they are accustomed to.
Code and data operands can be formatted in various numeric formats or
as symbols.
The project file uses a text format that is fairly diff-friendly, so
sharing projects through git works reasonably well.  If you want source
code you can assemble, SourceGen will generate code optimized for the
assembler of your choice.</p>

<p>The sharing and collaboration ideas only work if the formatting
capabilities within SourceGen are sufficiently flexible.  If you need to
generate assembly source and tweak it a bunch to express the intent of
the original code, then passing a SourceGen project around won't work.
This sort of thing is a bit outside the bounds of what a typical
disassembler does, so it remains to be seen whether SourceGen succeeds at
what it's trying to do, and also whether what it's trying to do is
something that people actually want.</p>

<p>You can get started by watching the
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and playing with the
<a href="tutorials.html">tutorials</a>.</p>


<h2><a name="fundamental-concepts">Fundamental Concepts</a></h2>

<p>The next few sections present some general concepts and terminology.  The
rest of the documentation assumes you've read and understood this.</p>
<p>It will be helpful if you already understand something about the 6502
instruction set and assembly-language programming, but disassembling
other programs is actually a pretty good way to learn how to code in
assembly.  You will need to be familiar with hexadecimal numbers and
general programming concepts to make sense of this, however.</p>

<h2><a name="begin">About 6502 Code</a></h2>

<p>For brevity's sake, "6502 code" should be taken to mean "code for
the 6502 CPU or any of its derivatives, including but not limited to
the 65C02 and 65816".  So let's talk about 6502 code.</p>

<p>Code usually arrives in a big binary blob.  Some of it will be
instructions, some of it will be data, some will be empty space used
for variable storage.  Part of the challenge of disassembly is
identifying which parts of the file contain which.</p>

<p>Much of the code you'll find for the 6502 was written by humans,
rather than generated by a compiler, which means it won't conform to a
standard set of conventions.  However, most programmers will use
subroutines, which can be identified and analyzed in isolation.  Subroutines
are often interspersed with variable storage, referred to as a "stash".
Variables and constants may be single-byte or multi-byte, the latter
typically in little-endian byte order.</p>

<p>Much of the data in a typical program is read-only, often in the
form of graphics or character string data.  Graphics can be difficult
to recognize automatically, but strings can be identified with a
reasonable degree of confidence.  Address tables, which are a collection
of addresses to other things, are also fairly common.</p>

<p>A simple disassembler would start at the top of the file and just
start converting bytes to instructions.  Unfortunately there's no reliable
way to tell the difference between instructions, data, and variable
stashes.  When the converter hits data bytes it'll start generating
instructions that won't make sense.  You'll have another problem when the
data ends and code resumes: 6502 instructions are variable-length, so if
the last byte of the data area appears to be a three-byte instruction,
the first two bytes of the next instruction area will be gobbled up.</p>

<p>To make things even more difficult (sometimes deliberately), programmers
will sometimes use a trick where they "embed" an instruction
inside another instruction.  This allows code to branch to two different
entry points, one of which will set a flag or load a register, and then
continue on to common code.</p>

<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
The called subroutine pulls the caller's address off the stack, uses it to
access the parameters, then pushes the address back on after modifying it to
point to an address past the inline data.  This can be very confusing
for the disassembler, which will try to interpret the inline data as
instructions.</p>

<p>Sometimes code is loaded at one location, then moved to another and
executed there.  If you're disassembling an executing program you don't
have to worry about this, but if you're disassembling the binary from the
loadable file on disk then you need to track the address changes.  The
address is communicated to the assembler with a "pseudo-opcode", usually
something like "ORG".  Other pseudo-op directives are used to define external
symbols and (for 65816 code) register widths.</p>

<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
range from $0000 to $ffff.  (I'm going to write hex values with a
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
that's what 6502 systems commonly used.)  The 65816 has a 24-bit address
space, but it's not contiguous -- a branch that extends past the end will
wrap around to the start of the 64KiB "bank".  For 16-bit instruction
operands, the bank is identified for instruction and data addresses
by the program bank register and the data bank register, respectively.
The disassembler can't always discern the value of the data bank
register through static analysis, so some user input may be required.</p>

<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
in it.  Some of the flags determine whether a conditional branch is taken
or not, which is important because some branches appear to be conditional
but actually are always or never taken in practice.  The disassembler needs
to be able to figure this out so that it doesn't try to disassemble the
bytes that follow an always-taken branch.
A more significant concern is the M and X flags found on the 65802/65816,
which determine the width of the registers and of immediate load
instructions.  If you don't know what state the flags are in, you can't
know whether <code>LDA #value</code> is two bytes or three, and the
disassembly of the instruction stream will come out wrong.</p>

<p>Some addresses correspond to memory-mapped I/O, rather than RAM or ROM.
Accessing the address can have side effects, like changing between text
and graphics modes.  Sometimes reading and writing have different effects.
For example, on later models of the Apple II, reading from
$C000 returns the most recently hit key, while writing to $C000 changes
how 80-column display memory is mapped.</p>
<p>On a few systems, such as the Atari 2600, RAM, ROM, and registers can
appear at multiple locations, "mirrored" across the address space.</p>

<h3><a name="charenc">Character Encoding</a></h3>

<p>The American Standard Code for Information Interchange (ASCII) was
developed in the 1960s, and became widely used as the means for representing
text data on a computer.  It's compatible with Unicode, in that the
binary representation of an ASCII string is exactly the same when
expressed as a Unicode string with UTF-8 encoding.</p>
<p>Not all 6502-based computers used ASCII, notably those from Commodore
International (e.g. PET, VIC-20, 64, 128), which used variants
collectively known as "PETSCII".  PETSCII had most of the same symbols,
but rearranged them, and added a number of graphical symbols.  This was
further complicated by the use of two different character sets, one of
which dropped lower-case letters in favor of additional symbols, and
the use of a separate encoding for characters stored in the text frame
buffer ("screen codes").</p>
<p>Apple II computers were based on ASCII, but tended to store bytes
with the high bit set rather than clear.  This is known as "high ASCII".</p>

<p>SourceGen allows you to specify that a string is encoded with ASCII,
High ASCII, C64 PETSCII, or C64 Screen Codes.  Because the goal is to
generate assembly sources for cross-assemblers, the C64 character
support is limited to the set that overlaps with ASCII.</p>
<p>For the most part only printable characters are accepted in strings,
but certain control characters are also allowed.  The characters for
bell ($07), linefeed ($0a), and carriage return ($0d) are recognized as
string data, and in C64 PETSCII a number of text color and formatting
control codes are also allowed.</p>


<h2><a name="sgintro">How SourceGen Works</a></h2>

<p>SourceGen employs a partial emulation technique that traces the flow
of execution.  Most of what a given instruction does isn't important;
only its effect on the flow of execution matters.</p>

<p>The code tracing has to start somewhere, so SourceGen uses "code entry
point hints" to identify places where execution may begin.  By default,
a hint is placed at the start of the file.  From there, the tracing process
walks through the code, pursuing all branches.  In many cases, if you
mark all external entry points, SourceGen will automatically find all
executable code and separate it from variable storage and data areas.</p>

<p>As noted earlier, tracking the processor status flags can make the
analysis more accurate.  Identifying situations where a branch instruction
is always or never taken avoids mis-categorizing a data region as code.
On the 65816, it's absolutely crucial to track the M/X flags, since those
affect the width of instructions.  SourceGen tracks the value of the
processor flags at every instruction, blending sets of flags together when
multiple paths of execution converge.</p>

<p>Once instructions and data have been separated, the instruction operands
can be examined.  Branches, loads, and stores that reference an address
that falls inside the address space covered by the file can be replaced
with a symbol.  Operands that refer to addresses outside the file, such
as ROM or operating system routines, can be replaced with a symbol defined
by an equate directive.</p>

(For more details on how this works, see the
<a href="analysis.html">analysis appendix</a>.)


<h3><a name="scripts">Extension Scripts</a></h3>

<p>Extension scripts are C# source files that are compiled and
executed by SourceGen.  They can be added to a project from SourceGen's
runtime data directory, or can live in the directory next to the project
file.</p>
<p>In the current implementation, scripts are only called to examine
JSR, JSL, and BRK instructions.  They can format nearby bytes as inline
data, or apply symbols to operands.</p>

<p>To reduce the chances of a script causing problems, all scripts are
executed in a sandbox with severely restricted access.  Notably, nothing
in the sandbox can access files, except to read files from the PluginDll
directory.</p>
<p>The PluginDll directory lives next to the SourceGen executable, and
contains all of the compiled script DLLs, as well as two pre-built
application DLLs that plugins are allowed access to.  The contents
are persistent, to avoid recompiling the scripts every time SourceGen
is launched, but may be manually deleted without harm.</p>
<p>More details can be found in the
<a href="advanced.html#extension-scripts">advanced topics</a> section.</p>


<h3><a name="hints">Analyzer Hints</a></h3>

<p>Sometimes SourceGen can't automatically find the start or end of an
instruction stream, or gets confused by inline data.  These situations
can be resolved by adding an appropriate hint.</p>

<p><b>Code entry point hints</b> tell the analyzer to add the offset
to the list of instruction start points.  Suppose you've got a code
library that begins with jump vectors, like this:</p>
<pre>
1000: 4c0910    JMP     $1009
1003: 4cef10    JMP     $10ef
1006: 4c3012    JMP     $1230
1009: 18        CLC
</pre>

<p>When opened with SourceGen, it will look like this:</p>
<pre>
         .ORG    $1000
         JMP     L1009

         .DD1    $4c
         .DD1    $ef
         .DD1    $10
         .DD1    $4c
         .DD1    $30
         .DD1    $12
L1009    CLC
</pre>

<p>SourceGen doesn't see any code that jumps to $1003 or $1006, so it
assumes those are data.  Further, the functions at those addresses may
also be considered data unless some bit of code reachable from L1009
calls into them.  If you add a code hint to $1003 and $1006,
you'll get better results:</p>
<pre>
         .ORG    $1000
         JMP     L1009
         JMP     L10ef
         JMP     L1230
L1009    CLC
</pre>

<p>Be careful that you only add hints to the instruction opcode.  If
you applied hints to the full range of bytes from $1003 to $1008, you would
end up with this:</p>
<pre>
         .ORG    $1000
         JMP     L1009
         JMP &#x25bc;   L10ef
         BPL &#x25bc;   L1053
         JMP &#x25bc;   L1230
         BMI     L101b
L1009    CLC
</pre>

<p>The exact set of instructions shown depends on your CPU configuration.
The problem is that the bytes in the middle of the instruction have
been marked as entry points, and SourceGen is treating them as
embedded instructions.  $EF and $12 aren't valid 6502 opcodes, so
they're being ignored, but $10 is BPL and $30 is BMI.  Because hinting
multiple consecutive bytes is rarely useful, SourceGen only applies code
hints to the first byte in a selected line.</p>

<p><b>Data hints</b> tell the analyzer when it should stop.  For example,
suppose address $ff00 is known to always be nonzero, and the code uses
that fact to get a branch-always on the 6502:</p>
<pre>
         .ORG    $1000
         LDA     $ff00
         BNE     L1010
         BRK     $11
</pre>

<p>By placing a data hint on the BRK, you're telling the analyzer that
it should stop the current path of execution.  (Note that this example
would actually be better solved by setting a status flag override on
the BNE that sets Z=0, so the code tracer will know it's a branch-always
and do the right thing.)  It's only necessary to place a hint on the
very first (opcode) byte.  Placing a data hint in the middle of what
SourceGen believes to be instruction will have no effect.</p>
<p>As with code hints, only the first byte in each selected line will
be hinted.</p>

<p><b>Inline data hints</b> identify bytes as being part of the
instruction stream, but not instructions.  A simple example of this
is the ProDOS 8 call interface on the Apple II, which looks like this:</p>
<pre>
         JSR     $bf00
         .DD1    $function
         .DD2    $address
         BCS     BAD
</pre>

<p>The three bytes following the <code>JSR $bf00</code> should be hinted
as inline data, so that the code analyzer skips them and continues the
analysis at the <code>BCS</code>.  Because you need to hint <i>every</i> byte
of inline data, all bytes in a selected line will receive hints.</p>
<p>If code branches into a region that is marked as inline data, the
branch will be ignored.</p>


<h2><a name="sgconcepts">SourceGen Concepts</a></h2>

<p>As you work on a disassembled file, formatting operands and adding
comments, everything you do is saved in the project file as "meta data".
None of the data from the file being disassembled is included.  This
should allow project files to be shared without violating the copyright
of the work being disassembled.  (This will vary by region.  Also, note
that the mere act of disassembling a piece of software may be illegal in
some cases.)</p>

<p>To avoid mix-ups where the wrong data file is used, the file's length
and CRC are stored in the project file.  SourceGen will refuse to open a
project if the data file's length and CRC don't match.</p>

<p>Most of the data in the project file is associated with a file offset.
When you create a comment, you aren't associating it with line 53, you're
associating it with the 127th byte in the file.  This ensures that, as the
project evolves, the comment you wrote is always connected to the
same instruction or data item.  This also means you can't have two
comments on the same line -- each offset only has room for one.  By
convention, file offsets are always shown as a six-digit hexadecimal value
with a leading '+', e.g. "+0012ab".  This makes it easy to distinguish
between an address and a offset.</p>

<p>Instruction and data operands can be formatted in various ways.  The
formatting choice is associated with the first offset of the item.  For
instructions the number of bytes in the operand is determined by the opcode
(and, on the 65816, the M/X status flags).  For data items the length
can be a single byte or an entire file.  Operand formats are not allowed
to overlap.</p>

<p>When an instruction or data operand references an address, we call
it a <b>numeric reference</b>.  When the target address has a label, and
the operand uses that symbol, we call that a <b>symbolic reference</b>.
SourceGen tries to establish symbolic references whenever possible,
so that the generated assembly source doesn't refer to hard-coded
locations within the program.  Labels are generated automatically for
the targets of numeric references.</p>

<p>As your understanding of the disassembled code develops, you will want
to add comments explaining it.  SourceGen projects have three kinds of
comments:</p>
<ol>
  <li>End-of-line comments.  As the name implies, these appear at the
    end of a line, to the right of the opcode or operand.</li>
  <li>Long comments, also known as multi-line comments.  These get a
    line all to themselves, and may span multiple lines.</li>
  <li>Notes.  Like long comments, these get a line to themselves.  Unlike
    long comments, these do not appear in generated assembly code.  They
    are a way for you to leave notes to yourself, perhaps "don't forget
    to figure this out" or "this is the cool part".</li>
</ol>
<p>Every file offset can have one of each.</p>

<p>Labels and comments may disappear if you associate them with a file
offset that is in the middle of a multi-byte instruction or data item.
For example, suppose you put a long comment at offset +000010, and then
mark a 50-byte region starting at offset +000008 as an ASCII string.  The
comment won't be deleted, but won't be displayed either.  The same thing
can happen to labels.  SourceGen will try to prevent this from happening
by splitting formatted data into sub-regions at label boundaries.</p>


<h2><a name="about-symbols">All About Symbols</a></h2>

<p>A symbol has two basic parts, a label and a value.  The label is a short
ASCII string; the value may be an 8-to-24-bit address or a 32-bit numeric
constant.  Symbols can be defined in different ways, and applied in
different ways.</p>

<p>The label syntax is restricted to a format that should be compatible
with most assemblers:</p>
<ul>
  <li>2-32 characters long.</li>
  <li>Starts with a letter or underscore.</li>
  <li>Comprised of ASCII letters, numbers, and the underscore.</li>
</ul>
<p>Label comparisons are case-sensitive, as is customary for programming
languages.</p>
<p>Sometimes the purpose of a subroutine or variable isn't immediately
clear, but you can take a reasonable guess.  You can document your
uncertainty by adding a question mark ('?') to the end of the label.
This isn't really part of the label, so it won't appear in the assembled
output, and you don't have to include it when searching for a symbol.</p>
<p>Some assemblers restrict the set of valid labels further.  For example,
64tass uses a leading underscore to indicate a local label, and reserves
a double leading underscore (e.g. <code>__label</code>) for its own
purposes.  In such cases, the label will be modified to comply with the
target assembler syntax.</p>

<p>Operands may use parts of symbols.  For example, if you have a label
<code>MYSTRING</code>, you can write:</p>
<pre>
MYSTRING .STR    "hello"
         LDA     #&lt;MYSTRING
         STA     $00
         LDA     #&gt;MYSTRING
         STA     $01
</pre>
<p>See <a href="#symbol-parts">Parts and Adjustments</a> for more details.</p>

<p>Symbols that represent a memory address within a project are treated
differently from those outside a project.  We refer to these as internal
and external addresses, respectively.</p>


<h3><a name="connecting-operands">Connecting Operands with Labels</a></h3>

<p>Suppose you have the following code:</p>
<pre>
         LDA     $1234
         JSR     $2345
</pre>
<p>If we put that in a source file, it will assemble correctly.
However, if those addresses are part of the file, the code may break if
changes are made and things assemble to different addresses.  It would
be better to generate code that references labels, e.g.:</p>
<pre>
         LDA     my_data
         JSR     nifty_func
</pre>
<p>SourceGen tries to establish labels for address operands automatically.
How this works depends on whether the operand's address is inside the file or
external, and whether there are existing labels at or near the target
address.  The details are explored in the next few sections.</p>
<p>On the 65816 this process is trickier, because addresses are 24 bits
instead of 16.  For a control-transfer instruction like <code>JSR</code>,
the high 8 bits come from the Program Bank Register (K).  For a data-access
instruction like <code>LDA</code>, the high 8 bits come from the Data
Bank Register (B).  The PBR value is determined by the address in which
the code is executing, so it's easy to determine.  The DBR value can be
set arbitrarily.  Sometimes it's easy to figure out, sometimes it has
to be specified manually.</p>


<h3><a name="internal-address-symbols">Internal Address Symbols</a></h3>

<p>Symbols that represent an address inside the file being disassembled
are referred to as <i>internal</i>.  They come in two varieties.</p>

<p><b>User labels</b> are labels added to instructions or data by the user.
The editor will try to prevent you from creating a label that has the same
name as another symbol, but if you manage to do so, the user label takes
precedence over symbols from other sources.  User labels may be tagged
as non-unique local, unique local, global, or global and exported.  Local
vs. global is important for the label localizer, while exported symbols
can be pulled directly into other projects.</p>

<p><b>Auto labels</b> are automatically generated labels placed on
instructions or data offsets that are the target of operands.  They're
formed by appending the hexadecimal address to the letter "L", with
additional characters added if some other symbol has already defined
that label.  Options can be set that change the "L" to a character or
characters based on how the label is referenced, e.g. "B" for branch targets.
Auto labels are only added where they are needed, and are removed when
no longer necessary.  Because auto labels may be renamed or vanish, the
editor will try to prevent you from referring to them explicitly when
editing operands.</p>


<h3><a name="external-address-symbols">External Address Symbols</a></h3>

<p>Symbols that represent an address outside the file being disassembled
are referred to as <i>external</i>.  These may be ROM entry points,
data buffers, zero-page variables, or a number of other things.  Because
the memory address they appear at aren't within the bounds of the file,
we can't simply put an address label on them.  Three different mechanisms
exist for defining them.  If an instruction or data operand refers to
an address outside the file bounds, SourceGen looks for a symbol with
a matching address value.</p>

<p><b>Platform symbols</b> are defined in platform symbol files.  These
are named with a ".sym65" extension, and have a fairly straightforward
name/value syntax.  Several files for popular platforms come with SourceGen
and live in the <code>RuntimeData</code> directory.  You can also create your
own, but they have to live in the same directory as the project file.</p>

<p>Platform symbols can be addresses or constants.  Addresses are
limited to 24-bit values, and are matched automatically.  Constants may
be 32-bit values, but must be specified manually.</p>

<p>If two platform symbols have the same label, only the most recently read
one is kept.  If two platform symbols have different labels but the
same value, both symbols will be kept, but the one in the file loaded
last will take priority when doing a lookup by address.  If symbols with
the same value are defined in the same file, the one whose symbol appears
first alphabetically takes priority.</p>

<p>Platform address symbols have an optional width.  This can be used
to define multi-byte items, such as two-byte pointers or 256-byte stacks.
If no width is specified, a default value of 1 is used.  Widths are ignored
for constants.
Overlapping symbols are resolved as described earlier, with symbols loaded
later taking priority over previously-loaded symbols.  In addition,
symbols defined closer to the target address take priority, so if you put
a 4-byte symbol in the middle of a 256-byte symbol, the 4-byte symbol will
be visible because the start point is closer to the addresses it covers
than the start of the 256-byte range.</p>

<p>Platform symbols can be designated for reading, writing, or both.
Normally you'd want both, but if an address is a memory-mapped I/O
location that has different behavior for reads and writes, you'd want
to define two different symbols, and have the correct one applied
based on the access type.</p>

<p><b>Project symbols</b> behave like platform symbols, but they are
defined in the project file itself, through the Project Properties editor.
The editor will try to prevent you from creating two symbols with the same
name.  If two symbols have the same value, the one whose label comes
first alphabetically is used.</p>

<p>Project symbols always have precedence over platform symbols, allowing
you to redefine symbols within a project.  (You can "hide" a platform
symbol by creating a project symbol constant with the same name.  Use a
value like $ffffffff or $deadbeef so you'll know why it's there.)</p>

<p><b>Local variables</b> are redefinable symbols that are organized
into tables.  They're used to specify labels for zero-page addresses
and 65816 stack-relative instructions.  These are explained in more
detail in the next section.</p>


<h4><a name="local-vars">How Local Variables Work</a></h4>

<p>Local variables are applied to instructions that have zero
page operands (<code>op ZP</code>, <code>op (ZP),Y</code>, etc.), or
65816 stack relative operands
(<code>op OFF,S</code> or <code>op (OFF,S),Y</code>).  While they must be
unique relative to other kinds of labels, they don't have to be unique
with respect to earlier variable definitions.  So you can define
<code>TMP .EQ $10</code>, and a few lines later define
<code>TMP .EQ $20</code>.  This is handy because zero-page addresses are
often used in different ways by different parts of the program.  For
example:</p>
<pre>
         LDA     ($00),Y
         INC     $02
         ... elsewhere ...
         DEC     $00
         STA     ($01),Y
</pre>
<p>If we had given <code>$00</code> the label <code>PTR</code> and
<code>$02</code> the label <code>COUNT</code> globally,
the second pair of instructions would look all wrong.  With local
variable tables you can set <code>PTR=$00 COUNT=$02</code> for the first chunk,
and <code>COUNT=$00 PTR=$01</code> for the second chunk.</p>

<p>Local variables have a value and a width.  If we create a pair of
variable definitions like this:</p>
<pre>
PTR      .eq     $00        ;2 bytes
COUNT    .eq     $02        ;1 byte
</pre>
<p>Then this:</p>
<pre>
         STA     $00
         STX     $01
         LDY     $02
</pre>
<p>Would become:</p>
<pre>
         STA     PTR
         STX     PTR+1
         LDY     COUNT
</pre>

<p>The scope of a variable definition starts at the point where it is
defined, and stops when its definition is erased.  There are three
ways for a table to erase an earlier definition:</p>
<ol>
  <li>Create a new definition with the same name.</li>
  <li>Create a new definition that has an overlapping value.  For
    example, if you have a two-byte variable <code>PTR = $00</code>,
    and define a one-byte variable <code>COUNT = $01</code>, the
    definition for <code>PTR</code> will be cleared because its second
    byte overlaps.</li>
  <li>Tables have a "clear previous" flag that erases all previous
    definitions.  This doesn't usually cause anything to be generated in the
    assembly sources; instead, it just causes SourceGen to stop using
    that label.</li>
</ol>
<p>As you might expect, you're not allowed to have duplicate labels or
overlapping values in an individual table.</p>
<p>If a platform/project symbol has the same value as a local variable,
the local variable is used.  If the local variable definition is cleared,
use of the platform/project symbol will resume.</p>
<p>Not all assemblers support redefinable variables.  In those cases,
the symbol names will be modified to be unique (e.g. the second definition
of <code>PTR</code> becomes <code>PTR_1</code>), and variables will have
global scope.</p>


<h3><a name="unique-local-global">Unique vs. Non-Unique and Local vs. Global</a></h3>

<p>Most assemblers have a notion of "local" labels, which have a scope
that is book-ended by global labels.  These are handy for generic branch
target names like "loop" or "notzero" that you might want to use in
multiple places.  The exact definition of local variable scope varies
between assemblers, so labels that you want to be local might have to
be promoted to global (and probably renamed).</p>
<p>SourceGen has a similar concept with a slight twist: they're called
non-unique labels, because the goal is to be able to use the same
label in more than one place.  Whether or not they actually turn out
to be local is a decision deferred to assembly source generation time.
(You can also declare a label to be a unique local if you like; the
auto-generated labels like "L1234" do this.)</p>
<p>When you're writing code for an assembler, it has to be unambiguous,
because the assembler can't guess at what the output should be.  For a
disassembler, the output is known, so a greater degree of ambiguity is
tolerable.  Instead of throwing errors and refusing to continue, the
source generator can modify the output until it works.  For example:<p>
<pre>
@LOOP    LDX     #$02
@LOOP    DEX
         BNE     @LOOP
         DEY
         BNE     @LOOP
</pre>
<p>This would confuse an assembler.  SourceGen already knows which @LOOP
is being branched to, so it can just rename one of them to "@LOOP1".</p>
<p>One situation where non-unique labels cause difficulty is with
weak symbolic references (see next section).  For example, suppose
the above code then did this:</p>
<pre>
         LDA     #&lt;@LOOP
</pre>
<p>While it's possible to make an educated guess at which @LOOP was
meant, it's easy to get wrong.  In situations like this, it's best to
give the labels different names.</p>


<h3><a name="weak-refs">Weak Symbolic References</a></h3>

<p>Symbolic references in operands are "weak references".  If the named
symbol exists, the reference is used.  If the symbol can't be found, the
operand is formatted in hex instead.  They're called "weak" because
failing to resolve the reference isn't considered an error.</p>

<p>It's important to know this when editing a project.  Consider the
following trivial chunk of code:</p>

<pre>
1000: 4c0310     JMP     $1003
1003: ea         NOP
</pre>

<p>When you load it into SourceGen, it will be formatted like this:</p>
<pre>
         .ORG    $1000
         JMP     L1003
L1003    NOP
</pre>

<p>The analyzer found the JMP operand, and created an auto label for
address $1003.  It then created a weak reference to "L1003" in the JMP
operand.</p>

<p>If you edit the JMP instruction's operand to use the symbol "FOO", the
results are probably not what you want:</p>
<pre>
         .ORG    $1000
         JMP     $1003
         NOP
</pre>

<p>This happened because you added a weak reference to "FOO" in the operand,
but the label doesn't exist.  The operand is formatted as hex.  Because
there's no longer a reference to L1003, SourceGen removed the auto-label
as well.</p>

<p>If you set the label "FOO" on the NOP instruction, you'll see what you
probably wanted:</p>
<pre>
         .ORG    $1000
         JMP     FOO
FOO      NOP
</pre>

<p>You don't actually need the explicit reference in the JMP instruction.
If you edit the JMP operand and set it back to "Default", the code will
still look the same.  This is because SourceGen identified the numeric
reference, and automatically added a symbolic reference to the label on
the NOP instruction.</p>

<p>However, suppose you didn't actually want FOO as the operand label.
You can create a project symbol, BAR with the value $1003, and then edit
the operand to reference BAR instead.  Your code would then look like:</p>
<pre>
BAR      .EQ     $1003
         .ORG    $1000
         JMP     BAR
FOO      NOP
</pre>

<p>If you change the value of BAR in the project symbol file, the operand
will continue to refer to it, but with an adjustment.  For example, if
you changed BAR from $1003 to $1007, the code would become:</p>
<pre>
BAR      .EQ     $1007
         .ORG    $1000
         JMP     BAR-4
FOO      NOP
</pre>

<p>If you rename a label, all references to that label are updated.  For
numeric references that happens implicitly.  For explicit operand
references, the weak references are updated individually.  (Modern IDEs
call this "refactoring".)</p>
<p>If you remove a label, all of the numeric references to it will
reference something else, probably a new auto label.  Weak references
to the symbol will break and be formatted as hex, but will not be
removed.  Similarly, removing symbols from a platform or project file
will break the reference but won't modify the operands.</p>

<h3><a name="symbol-parts">Parts and Adjustments</a></h3>

<p>Sometimes you want to use part of a label, or adjust the value slightly.
(I use "adjustment" rather than "offset" to avoid confusing it with file
offsets.) Consider the following example:</p>
<pre>
1000: a910      LDA     #$10
1002: 48        PHA
1003: a906      LDA     #$06
1005: 48        PHA
1006: 60        RTS
1007: 4c3aff    JMP     $ff3a
</pre>

<p>This pushes the address of the JMP instruction ($1007) onto the stack,
and jumps to it with the RTS instruction.  However, RTS requires the
address of the byte before the target instruction, so we actually push
$1006.</p>

<p>The disassembler won't know that offset $1007 is code because nothing
appears to reference it.  After adding a code hint at $1007, the project
looks like this:</p>
<pre>
         LDA     #$10
         PHA
         LDA     #$06
         PHA
         RTS

         JMP     $ff3a
</pre>

<p>We set a label called "NEXT" on the JMP instruction, and then edit
the two LDA instructions to reference the high and low parts, yielding:</p>
<pre>
         .ORG    $1000
         LDA     #&gt;NEXT
         PHA
         LDA     #&lt;NEXT-1
         PHA
         RTS

NEXT     JMP     $ff3a
</pre>

<p>SourceGen will adjust label values by whatever amount is required to
generate the original value.  If the adjustment seems wrong, make sure
you're selecting the right part of the symbol.</p>

<p>Different assemblers use different syntaxes to form expressions.  This
is particularly noticeable in 65816 code.  You can adjust how it appears
on-screen from the app settings.</p>

<h3><a name="nearby-targets">Automatic Use of Nearby Targets</a></h3>

<p>Sometimes you want to use a symbol that doesn't match up with the
operand.  SourceGen tries to anticipate situations where that might be
the case, and apply adjustments for you.</p>

<p>Suppose you have the following:</p>
<pre>
         .ORG    $1000
         LDA     #$00
         STA     L1010
         LDA     #$20
         STA     L1011
         LDA     #$e1
         STA     L1012
         RTS

L1010    .DD1    $00
L1011    .DD1    $00
L1012    .DD1    $00
</pre>

<p>Showing stores to three different labeled addresses is fine, but
the code is actually setting up a single 24-bit address.  For clarity,
you'd like the output to reflect the fact that it's a single, multi-byte
variable.  So, if you set a label at $1010, SourceGen removes the
nearby auto labels, and sets the numeric references to use your label:</p>

<pre>
         .ORG    $1000
         LDA     #$00
         STA     DATA
         LDA     #$20
         STA     DATA+1
         LDA     #$e1
         STA     DATA+2
         RTS

DATA     .DD1    $00
         .DD1    $00
         .DD1    $00
</pre>

<p>If you decide that you really wanted each store to have its own
label, you can set labels on the other two addresses.  SourceGen won't
search for alternate labels if the numeric reference target has a
user-defined label.</p>

<p>This is also used for self-modifying code.  For example:</p>
<pre>
1000: a9ff      LDA     #$ff
1002: 8d0610    STA     $1006
1005: 4900      EOR     #$00
</pre>

<p>The above changes the <code>EOR #$00</code> instruction to
<code>EOR #$ff</code>.  The operand target is $1006, but we can't
put a label there because it's in the middle of the instruction.  So
SourceGen puts a label at $1005 and adjusts it:</p>
<pre>
         LDA     #$ff
         STA     L1005+1
L1005    EOR     #$00
</pre>

<p>If you really don't like the way this works, you can disable the
search for nearby targets entirely from the
<a href="settings.html#project-properties">project properties</a>.
Self-modifying code will always be adjusted because of the limitation
on mid-instruction labels.</p>


<h2><a name="width-disambiguation">Width Disambiguation</a></h2>

<p>It's possible to interpret certain instructions in multiple ways.
For example, "LDA $0000" might be an absolute load from a 16-bit
address, or it might be a direct page load from an 8-bit address.
Humans can infer from the fact that it was written with a 4-digit address
that it's meant to be absolute, but assemblers often treat operands
purely as numbers, and would just see "LDA 0".  Common practice is to
use the shortest instruction possible.</p>
<p>Every assembler seems to address the problem in a slightly different
way.  Some use opcode suffixes, others use operand prefixes, some
allow both.  You can configure how they appear in the
<a href="settings.html#app-settings">application settings</a>.</p>
<p>SourceGen will only add width disambiguators to opcodes or operands when
they are needed, with one exception: the opcode suffix for long
(24-bit address) operations is always applied.  This is done because some
assemblers require it, insisting on "LDAL" rather than "LDA" for an
absolute long load, and because it can make 65816 code easier to read.</p>


<h2><a name="pseudo-ops">Data and Directive Pseudo-Opcodes</a></h2>

<p>The on-screen code list shows assembler directives that are similar
to what the various cross-assemblers provide.  The actual directives
generated for a given assembler may match exactly or be totally different.
The idea is to represent the concept behind the directive, then let the
code generator figure out the implementation details.</p>

<p>There are six assembler directives that appear in the code list:</p>
<ul>
  <li>.EQ - defines a symbol's value.  These are generated automatically
    when an operand that matches a platform or project symbol is found.</li>
  <li>.VAR - defines a local variable.  These are generated for
    local variable tables.</li>
  <li>.ORG - changes the target address.</li>
  <li>.RWID - specifies the width of the accumulator and index registers
    (65816 only).  Note this doesn't change the actual width, just tells
    the assembler that the width has changed.</li>
  <li>.DBANK - specifies what value the Data Bank Register holds
    (65816 only).  Used when matching operands to labels.</li>
  <li>.JUNK - indicates that the data in a range of bytes is irrelevant.
    (When generating sources, this will become .FILL or .BULK
    depending on the contents of the memory region and the assembler's
    capabilities.)</li>
  <li>.ALIGN - a special case of .JUNK that indicates the irrelevant
    bytes exist to force alignment to a memory boundary (usually a
    256-byte page).  Depending on the memory contents, it may be possible
    to output this as an assembler-specific alignment directive.</li>
</ul>

<p>Every data item is represented by a pseudo-op.  Some of them may
represent hundreds of bytes and span multiple lines.</p>
<ul>
  <li>.DD1, .DD2, .DD3, .DD4 - basic "define data" op.  A 1-4 byte
    little-endian value.</li>
  <li>.DBD2, .DBD3, .DBD4 - "define big-endian data".  2-4 bytes of
    big-endian data.  (The 3- and 4-byte versions are not currently
    available in the UI, since they're very unusual and few assemblers
    support them.)</li>
  <li>.BULK - data packed in as compact a form as the assembler allows.
    Useful for chunks of graphics data.</li>
  <li>.FILL - a series of identical bytes.  The operand
    has two parts, the byte count followed by the byte value.</li>
</ul>

<p>In addition, several pseudo-ops are defined for string constants:</p>
<ul>
  <li>.STR - basic character string.</li>
  <li>.RSTR - string in reverse order.</li>
  <li>.ZSTR - null-terminated string.</li>
  <li>.DSTR - Dextral Character Inverted string.  The high bit of the
    last byte is flipped.</li>
  <li>.L1STR - string prefixed with a length byte.</li>
  <li>.L2STR - string prefixed with a length word.</li>
</ul>

<p>You can configure the pseudo-operands to look more like what your
favorite assembler uses in the
<a href="settings.html#appset-pseudoop">Pseudo-Op</a> tab in the
application settings.</p>

<p>String constants start and end with delimiter characters, typically
single or double quotes.  You can configure the delimiters differently
for each character encoding, so that it's obvious whether the text is
in ASCII or PETSCII.  See the
<a href="settings.html#appset-textdelim">Text Delimiters</a> tab in
the application settings.</p>


</div>

<div id="footer">
<p><a href="index.html">Back to index</a></p>
</div>
</body>
<!-- Copyright 2018 faddenSoft -->
</html>