2021-10-19 00:56:08 +00:00
|
|
|
<!DOCTYPE html>
|
|
|
|
<html lang="en">
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<head>
|
2021-10-19 00:56:08 +00:00
|
|
|
<meta charset="utf-8"/>
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="main.css"/>
|
|
|
|
<title>Intro - 6502bench SourceGen</title>
|
2018-09-28 17:05:11 +00:00
|
|
|
</head>
|
|
|
|
|
|
|
|
<body>
|
2018-10-09 17:04:10 +00:00
|
|
|
<div id="content">
|
2021-10-19 00:56:08 +00:00
|
|
|
<h1>SourceGen: Intro</h1>
|
2018-09-28 17:05:11 +00:00
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h2 id="overview">Overview</h2>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>SourceGen converts 6502/65C02/65816 machine-language programs to
|
|
|
|
assembly-language source.</p>
|
|
|
|
|
|
|
|
<p>SourceGen has two purposes. The first is to be a really nice
|
|
|
|
disassembler for the 6502 and related CPUs. Code tracing with status
|
|
|
|
flag tracking makes it easier to separate the code from the data,
|
2019-08-18 23:42:40 +00:00
|
|
|
automatic formatting of character strings and filled-data areas helps
|
2018-09-28 17:05:11 +00:00
|
|
|
get the data regions sorted out, and modern IDE-style features like
|
|
|
|
cross-reference generation and color-highlighted bookmarks help
|
|
|
|
navigate the code while trying to figure out what it does. A
|
|
|
|
disassembler should help you understand the code, not just dump the
|
|
|
|
instructions to a text file.</p>
|
2019-11-19 01:45:41 +00:00
|
|
|
<p>The computer I built back in 2014 has a 4GHz CPU and 8GB of RAM. I
|
|
|
|
figured we should put the power of modern computing hardware to good use.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>The second purpose is to facilitate sharing and collaboration. Most
|
|
|
|
disassemblers generate output for a specific assembler, or in a way that's
|
|
|
|
generic enough to match most any assembler; either way, you're left with
|
|
|
|
a text file in somebody's idea of the "correct" format. SourceGen keeps
|
|
|
|
everything in an assembler-neutral format, and provides numerous options
|
|
|
|
for customizing the display, so that multiple people viewing the same
|
|
|
|
project can each do so with the conventions they are accustomed to.
|
|
|
|
Code and data operands can be formatted in various numeric formats or
|
|
|
|
as symbols.
|
|
|
|
The project file uses a text format that is fairly diff-friendly, so
|
|
|
|
sharing projects through git works reasonably well. If you want source
|
|
|
|
code you can assemble, SourceGen will generate code optimized for the
|
|
|
|
assembler of your choice.</p>
|
|
|
|
|
|
|
|
<p>The sharing and collaboration ideas only work if the formatting
|
|
|
|
capabilities within SourceGen are sufficiently flexible. If you need to
|
|
|
|
generate assembly source and tweak it a bunch to express the intent of
|
|
|
|
the original code, then passing a SourceGen project around won't work.
|
|
|
|
This sort of thing is a bit outside the bounds of what a typical
|
2018-10-04 01:03:04 +00:00
|
|
|
disassembler does, so it remains to be seen whether SourceGen succeeds at
|
|
|
|
what it's trying to do, and also whether what it's trying to do is
|
|
|
|
something that people actually want.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2021-10-08 00:24:12 +00:00
|
|
|
<p>You can get started by watching a
|
|
|
|
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and working through
|
|
|
|
the <a href="https://6502bench.com/sgtutorial/">tutorials</a>.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h2 id="fundamental-concepts">Fundamentals</h2>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>The next few sections present some general concepts and terminology. The
|
2019-10-18 23:19:42 +00:00
|
|
|
rest of the documentation assumes you've read and understood this.</p>
|
|
|
|
<p>It will be helpful if you already understand something about the 6502
|
|
|
|
instruction set and assembly-language programming, but disassembling
|
|
|
|
other programs is actually a pretty good way to learn how to code in
|
|
|
|
assembly. You will need to be familiar with hexadecimal numbers and
|
2019-11-09 04:44:45 +00:00
|
|
|
general programming concepts to make sense of this, however.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h3 id="begin">About 6502 Code</h3>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>For brevity's sake, "6502 code" should be taken to mean "code for
|
|
|
|
the 6502 CPU or any of its derivatives, including but not limited to
|
|
|
|
the 65C02 and 65816". So let's talk about 6502 code.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Code usually arrives in a big binary blob. Some of it will be
|
|
|
|
instructions, some of it will be data, some will be empty space used
|
|
|
|
for variable storage. Part of the challenge of disassembly is
|
|
|
|
identifying which parts of the file contain which.</p>
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<p>Much of the code you'll find for the 6502 was written by humans,
|
|
|
|
rather than generated by a compiler, which means it won't conform to a
|
2018-10-04 01:03:04 +00:00
|
|
|
standard set of conventions. However, most programmers will use
|
|
|
|
subroutines, which can be identified and analyzed in isolation. Subroutines
|
|
|
|
are often interspersed with variable storage, referred to as a "stash".
|
2019-07-29 20:20:03 +00:00
|
|
|
Variables and constants may be single-byte or multi-byte, the latter
|
|
|
|
typically in little-endian byte order.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Much of the data in a typical program is read-only, often in the
|
2019-08-18 23:42:40 +00:00
|
|
|
form of graphics or character string data. Graphics can be difficult
|
2018-10-04 01:03:04 +00:00
|
|
|
to recognize automatically, but strings can be identified with a
|
|
|
|
reasonable degree of confidence. Address tables, which are a collection
|
|
|
|
of addresses to other things, are also fairly common.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>A simple disassembler would start at the top of the file and just
|
|
|
|
start converting bytes to instructions. Unfortunately there's no reliable
|
|
|
|
way to tell the difference between instructions, data, and variable
|
|
|
|
stashes. When the converter hits data bytes it'll start generating
|
|
|
|
instructions that won't make sense. You'll have another problem when the
|
|
|
|
data ends and code resumes: 6502 instructions are variable-length, so if
|
|
|
|
the last byte of the data area appears to be a three-byte instruction,
|
|
|
|
the first two bytes of the next instruction area will be gobbled up.</p>
|
|
|
|
|
2019-07-29 20:20:03 +00:00
|
|
|
<p>To make things even more difficult (sometimes deliberately), programmers
|
|
|
|
will sometimes use a trick where they "embed" an instruction
|
2018-09-28 17:05:11 +00:00
|
|
|
inside another instruction. This allows code to branch to two different
|
|
|
|
entry points, one of which will set a flag or load a register, and then
|
|
|
|
continue on to common code.</p>
|
|
|
|
|
|
|
|
<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
|
2019-07-29 20:20:03 +00:00
|
|
|
The called subroutine pulls the caller's address off the stack, uses it to
|
|
|
|
access the parameters, then pushes the address back on after modifying it to
|
2018-09-28 17:05:11 +00:00
|
|
|
point to an address past the inline data. This can be very confusing
|
|
|
|
for the disassembler, which will try to interpret the inline data as
|
|
|
|
instructions.</p>
|
|
|
|
|
|
|
|
<p>Sometimes code is loaded at one location, then moved to another and
|
|
|
|
executed there. If you're disassembling an executing program you don't
|
|
|
|
have to worry about this, but if you're disassembling the binary from the
|
|
|
|
loadable file on disk then you need to track the address changes. The
|
|
|
|
address is communicated to the assembler with a "pseudo-opcode", usually
|
2021-10-08 00:24:12 +00:00
|
|
|
something like "ORG" (short for "origin"). Other pseudo-op directives
|
|
|
|
are used to define things like constants and (for 65816 code)
|
|
|
|
register widths.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
|
|
|
|
range from $0000 to $ffff. (I'm going to write hex values with a
|
|
|
|
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
|
|
|
|
that's what 6502 systems commonly used.) The 65816 has a 24-bit address
|
|
|
|
space, but it's not contiguous -- a branch that extends past the end will
|
|
|
|
wrap around to the start of the 64KiB "bank". For 16-bit instruction
|
|
|
|
operands, the bank is identified for instruction and data addresses
|
|
|
|
by the program bank register and the data bank register, respectively.
|
2020-07-10 20:29:36 +00:00
|
|
|
The disassembler can't always discern the value of the data bank
|
|
|
|
register through static analysis, so some user input may be required.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
|
|
|
|
in it. Some of the flags determine whether a conditional branch is taken
|
|
|
|
or not, which is important because some branches appear to be conditional
|
|
|
|
but actually are always or never taken in practice. The disassembler needs
|
|
|
|
to be able to figure this out so that it doesn't try to disassemble the
|
|
|
|
bytes that follow an always-taken branch.
|
|
|
|
A more significant concern is the M and X flags found on the 65802/65816,
|
|
|
|
which determine the width of the registers and of immediate load
|
|
|
|
instructions. If you don't know what state the flags are in, you can't
|
|
|
|
know whether <code>LDA #value</code> is two bytes or three, and the
|
|
|
|
disassembly of the instruction stream will come out wrong.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2019-10-17 00:32:30 +00:00
|
|
|
<p>Some addresses correspond to memory-mapped I/O, rather than RAM or ROM.
|
|
|
|
Accessing the address can have side effects, like changing between text
|
|
|
|
and graphics modes. Sometimes reading and writing have different effects.
|
|
|
|
For example, on later models of the Apple II, reading from
|
2019-10-22 00:03:13 +00:00
|
|
|
$C000 returns the most recently hit key, while writing to $C000 changes
|
|
|
|
how 80-column display memory is mapped.</p>
|
2019-10-17 00:32:30 +00:00
|
|
|
<p>On a few systems, such as the Atari 2600, RAM, ROM, and registers can
|
|
|
|
appear at multiple locations, "mirrored" across the address space.</p>
|
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h3 id="charenc">Character Encoding</h3>
|
2019-08-18 23:42:40 +00:00
|
|
|
|
|
|
|
<p>The American Standard Code for Information Interchange (ASCII) was
|
|
|
|
developed in the 1960s, and became widely used as the means for representing
|
|
|
|
text data on a computer. It's compatible with Unicode, in that the
|
|
|
|
binary representation of an ASCII string is exactly the same when
|
|
|
|
expressed as a Unicode string with UTF-8 encoding.</p>
|
|
|
|
<p>Not all 6502-based computers used ASCII, notably those from Commodore
|
|
|
|
International (e.g. PET, VIC-20, 64, 128), which used variants
|
|
|
|
collectively known as "PETSCII". PETSCII had most of the same symbols,
|
|
|
|
but rearranged them, and added a number of graphical symbols. This was
|
|
|
|
further complicated by the use of two different character sets, one of
|
|
|
|
which dropped lower-case letters in favor of additional symbols, and
|
|
|
|
the use of a separate encoding for characters stored in the text frame
|
|
|
|
buffer ("screen codes").</p>
|
|
|
|
<p>Apple II computers were based on ASCII, but tended to store bytes
|
|
|
|
with the high bit set rather than clear. This is known as "high ASCII".</p>
|
|
|
|
|
|
|
|
<p>SourceGen allows you to specify that a string is encoded with ASCII,
|
|
|
|
High ASCII, C64 PETSCII, or C64 Screen Codes. Because the goal is to
|
|
|
|
generate assembly sources for cross-assemblers, the C64 character
|
|
|
|
support is limited to the set that overlaps with ASCII.</p>
|
|
|
|
<p>For the most part only printable characters are accepted in strings,
|
|
|
|
but certain control characters are also allowed. The characters for
|
|
|
|
bell ($07), linefeed ($0a), and carriage return ($0d) are recognized as
|
|
|
|
string data, and in C64 PETSCII a number of text color and formatting
|
|
|
|
control codes are also allowed.</p>
|
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h3 id="sgconcepts">SourceGen Concepts</h3>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>As you work on a disassembled file, formatting operands and adding
|
|
|
|
comments, everything you do is saved in the project file as "meta data".
|
|
|
|
None of the data from the file being disassembled is included. This
|
|
|
|
should allow project files to be shared without violating the copyright
|
|
|
|
of the work being disassembled. (This will vary by region. Also, note
|
|
|
|
that the mere act of disassembling a piece of software may be illegal in
|
|
|
|
some cases.)</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>To avoid mix-ups where the wrong data file is used, the file's length
|
|
|
|
and CRC are stored in the project file. SourceGen will refuse to open a
|
|
|
|
project if the data file's length and CRC don't match.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>Most of the data in the project file is associated with a file offset.
|
|
|
|
When you create a comment, you aren't associating it with line 53, you're
|
|
|
|
associating it with the 127th byte in the file. This ensures that, as the
|
|
|
|
project evolves, the comment you wrote is always connected to the
|
|
|
|
same instruction or data item. This also means you can't have two
|
|
|
|
comments on the same line -- each offset only has room for one. By
|
|
|
|
convention, file offsets are always shown as a six-digit hexadecimal value
|
|
|
|
with a leading '+', e.g. "+0012ab". This makes it easy to distinguish
|
|
|
|
between an address and a offset.</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Instruction and data operands can be formatted in various ways. The
|
|
|
|
formatting choice is associated with the first offset of the item. For
|
|
|
|
instructions the number of bytes in the operand is determined by the opcode
|
|
|
|
(and, on the 65816, the M/X status flags). For data items the length
|
|
|
|
can be a single byte or an entire file. Operand formats are not allowed
|
|
|
|
to overlap.</p>
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<p>When an instruction or data operand references an address, we call
|
|
|
|
it a <b>numeric reference</b>. When the target address has a label, and
|
|
|
|
the operand uses that symbol, we call that a <b>symbolic reference</b>.
|
|
|
|
SourceGen tries to establish symbolic references whenever possible,
|
|
|
|
so that the generated assembly source doesn't refer to hard-coded
|
2018-10-04 01:03:04 +00:00
|
|
|
locations within the program. Labels are generated automatically for
|
|
|
|
the targets of numeric references.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>As your understanding of the disassembled code develops, you will want
|
|
|
|
to add comments explaining it. SourceGen projects have three kinds of
|
|
|
|
comments:</p>
|
|
|
|
<ol>
|
|
|
|
<li>End-of-line comments. As the name implies, these appear at the
|
2018-10-09 17:04:10 +00:00
|
|
|
end of a line, to the right of the opcode or operand.</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>Long comments, also known as multi-line comments. These get a
|
2018-10-09 17:04:10 +00:00
|
|
|
line all to themselves, and may span multiple lines.</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>Notes. Like long comments, these get a line to themselves. Unlike
|
|
|
|
long comments, these do not appear in generated assembly code. They
|
|
|
|
are a way for you to leave notes to yourself, perhaps "don't forget
|
2018-10-09 17:04:10 +00:00
|
|
|
to figure this out" or "this is the cool part".</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
</ol>
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Every file offset can have one of each.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>Labels and comments may disappear if you associate them with a file
|
|
|
|
offset that is in the middle of a multi-byte instruction or data item.
|
|
|
|
For example, suppose you put a long comment at offset +000010, and then
|
|
|
|
mark a 50-byte region starting at offset +000008 as an ASCII string. The
|
|
|
|
comment won't be deleted, but won't be displayed either. The same thing
|
2018-10-04 01:03:04 +00:00
|
|
|
can happen to labels. SourceGen will try to prevent this from happening
|
|
|
|
by splitting formatted data into sub-regions at label boundaries.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
|
2021-10-19 00:56:08 +00:00
|
|
|
<h2 id="sgintro">How SourceGen Works</h2>
|
2021-10-08 00:24:12 +00:00
|
|
|
|
|
|
|
<p>SourceGen employs a partial emulation technique that traces the flow
|
|
|
|
of execution through the program. Most of what a given instruction does
|
|
|
|
isn't important; only its effect on the flow of execution matters. This
|
|
|
|
makes SourceGen different from most other disassemblers, because instead
|
|
|
|
of assuming everything is code and expecting the user to separate out the
|
|
|
|
data, it assumes everything is data and asks the user to identify where the
|
|
|
|
code starts executing.</p>
|
|
|
|
|
|
|
|
<p>SourceGen uses "code start points" to tag places where execution may
|
|
|
|
begin. By default, the first byte of the file is marked as a start point.
|
|
|
|
From there, the tracing process walks through the code, pursuing all
|
|
|
|
branches. In many cases, if you tag all external entry points, SourceGen
|
|
|
|
will automatically find all executable code and separate it from variable
|
|
|
|
storage and data areas.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2021-10-08 00:24:12 +00:00
|
|
|
<p>As noted earlier, tracking the processor status flags can make the
|
|
|
|
analysis more accurate. Identifying situations where a branch instruction
|
|
|
|
is always or never taken avoids mis-categorizing a data region as code.
|
|
|
|
On the 65816, it's absolutely crucial to track the M/X flags, since those
|
|
|
|
affect the width of instructions. SourceGen tracks the value of the
|
|
|
|
processor flags at every instruction, blending sets of flags together when
|
|
|
|
multiple paths of execution converge.</p>
|
|
|
|
|
|
|
|
<p>Once instructions and data have been separated, the instruction operands
|
|
|
|
can be examined. Branches, loads, and stores that reference an address
|
|
|
|
that falls inside the address space covered by the file can be replaced
|
|
|
|
with a symbol. Operands that refer to addresses outside the file, such
|
|
|
|
as ROM or operating system routines, can be replaced with a symbol defined
|
|
|
|
by an equate directive.</p>
|
|
|
|
|
|
|
|
(For more details on how this works, see the
|
|
|
|
<a href="analysis.html">analysis appendix</a>.)
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
</div>
|
|
|
|
|
2018-10-09 17:04:10 +00:00
|
|
|
<div id="footer">
|
2018-09-28 17:05:11 +00:00
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
|
|
</html>
|