mirror of
https://github.com/fadden/6502bench.git
synced 2025-01-21 21:32:09 +00:00
ed4cc84782
Move the SourceGen manual to a subdirectory in "docs", so that it can be accessed directly from the 6502bench web site. The place where it's installed in the distribution doesn't change.
293 lines
15 KiB
HTML
293 lines
15 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<head>
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
<title>Intro - 6502bench SourceGen</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="content">
|
|
<h1>6502bench SourceGen: Intro</h1>
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
<h2><a name="overview">Overview</a></h2>
|
|
|
|
<p>SourceGen converts 6502/65C02/65816 machine-language programs to
|
|
assembly-language source.</p>
|
|
|
|
<p>SourceGen has two purposes. The first is to be a really nice
|
|
disassembler for the 6502 and related CPUs. Code tracing with status
|
|
flag tracking makes it easier to separate the code from the data,
|
|
automatic formatting of character strings and filled-data areas helps
|
|
get the data regions sorted out, and modern IDE-style features like
|
|
cross-reference generation and color-highlighted bookmarks help
|
|
navigate the code while trying to figure out what it does. A
|
|
disassembler should help you understand the code, not just dump the
|
|
instructions to a text file.</p>
|
|
<p>The computer I built back in 2014 has a 4GHz CPU and 8GB of RAM. I
|
|
figured we should put the power of modern computing hardware to good use.</p>
|
|
|
|
<p>The second purpose is to facilitate sharing and collaboration. Most
|
|
disassemblers generate output for a specific assembler, or in a way that's
|
|
generic enough to match most any assembler; either way, you're left with
|
|
a text file in somebody's idea of the "correct" format. SourceGen keeps
|
|
everything in an assembler-neutral format, and provides numerous options
|
|
for customizing the display, so that multiple people viewing the same
|
|
project can each do so with the conventions they are accustomed to.
|
|
Code and data operands can be formatted in various numeric formats or
|
|
as symbols.
|
|
The project file uses a text format that is fairly diff-friendly, so
|
|
sharing projects through git works reasonably well. If you want source
|
|
code you can assemble, SourceGen will generate code optimized for the
|
|
assembler of your choice.</p>
|
|
|
|
<p>The sharing and collaboration ideas only work if the formatting
|
|
capabilities within SourceGen are sufficiently flexible. If you need to
|
|
generate assembly source and tweak it a bunch to express the intent of
|
|
the original code, then passing a SourceGen project around won't work.
|
|
This sort of thing is a bit outside the bounds of what a typical
|
|
disassembler does, so it remains to be seen whether SourceGen succeeds at
|
|
what it's trying to do, and also whether what it's trying to do is
|
|
something that people actually want.</p>
|
|
|
|
<p>You can get started by watching a
|
|
<a href="https://youtu.be/dalISyBPQq8">demo video</a> and working through
|
|
the <a href="https://6502bench.com/sgtutorial/">tutorials</a>.</p>
|
|
|
|
|
|
<h2><a name="fundamental-concepts">Fundamentals</a></h2>
|
|
|
|
<p>The next few sections present some general concepts and terminology. The
|
|
rest of the documentation assumes you've read and understood this.</p>
|
|
<p>It will be helpful if you already understand something about the 6502
|
|
instruction set and assembly-language programming, but disassembling
|
|
other programs is actually a pretty good way to learn how to code in
|
|
assembly. You will need to be familiar with hexadecimal numbers and
|
|
general programming concepts to make sense of this, however.</p>
|
|
|
|
<h3><a name="begin">About 6502 Code</a></h3>
|
|
|
|
<p>For brevity's sake, "6502 code" should be taken to mean "code for
|
|
the 6502 CPU or any of its derivatives, including but not limited to
|
|
the 65C02 and 65816". So let's talk about 6502 code.</p>
|
|
|
|
<p>Code usually arrives in a big binary blob. Some of it will be
|
|
instructions, some of it will be data, some will be empty space used
|
|
for variable storage. Part of the challenge of disassembly is
|
|
identifying which parts of the file contain which.</p>
|
|
|
|
<p>Much of the code you'll find for the 6502 was written by humans,
|
|
rather than generated by a compiler, which means it won't conform to a
|
|
standard set of conventions. However, most programmers will use
|
|
subroutines, which can be identified and analyzed in isolation. Subroutines
|
|
are often interspersed with variable storage, referred to as a "stash".
|
|
Variables and constants may be single-byte or multi-byte, the latter
|
|
typically in little-endian byte order.</p>
|
|
|
|
<p>Much of the data in a typical program is read-only, often in the
|
|
form of graphics or character string data. Graphics can be difficult
|
|
to recognize automatically, but strings can be identified with a
|
|
reasonable degree of confidence. Address tables, which are a collection
|
|
of addresses to other things, are also fairly common.</p>
|
|
|
|
<p>A simple disassembler would start at the top of the file and just
|
|
start converting bytes to instructions. Unfortunately there's no reliable
|
|
way to tell the difference between instructions, data, and variable
|
|
stashes. When the converter hits data bytes it'll start generating
|
|
instructions that won't make sense. You'll have another problem when the
|
|
data ends and code resumes: 6502 instructions are variable-length, so if
|
|
the last byte of the data area appears to be a three-byte instruction,
|
|
the first two bytes of the next instruction area will be gobbled up.</p>
|
|
|
|
<p>To make things even more difficult (sometimes deliberately), programmers
|
|
will sometimes use a trick where they "embed" an instruction
|
|
inside another instruction. This allows code to branch to two different
|
|
entry points, one of which will set a flag or load a register, and then
|
|
continue on to common code.</p>
|
|
|
|
<p>Another trick is to embed "inline data" after a JSR or JSL instruction.
|
|
The called subroutine pulls the caller's address off the stack, uses it to
|
|
access the parameters, then pushes the address back on after modifying it to
|
|
point to an address past the inline data. This can be very confusing
|
|
for the disassembler, which will try to interpret the inline data as
|
|
instructions.</p>
|
|
|
|
<p>Sometimes code is loaded at one location, then moved to another and
|
|
executed there. If you're disassembling an executing program you don't
|
|
have to worry about this, but if you're disassembling the binary from the
|
|
loadable file on disk then you need to track the address changes. The
|
|
address is communicated to the assembler with a "pseudo-opcode", usually
|
|
something like "ORG" (short for "origin"). Other pseudo-op directives
|
|
are used to define things like constants and (for 65816 code)
|
|
register widths.</p>
|
|
|
|
<p>The 8-bit CPUs have a 16-bit (64KiB) address space, so addresses can
|
|
range from $0000 to $ffff. (I'm going to write hex values with a
|
|
preceding '$', like "$12ab", rather than "0x12ab" or "12abh", because
|
|
that's what 6502 systems commonly used.) The 65816 has a 24-bit address
|
|
space, but it's not contiguous -- a branch that extends past the end will
|
|
wrap around to the start of the 64KiB "bank". For 16-bit instruction
|
|
operands, the bank is identified for instruction and data addresses
|
|
by the program bank register and the data bank register, respectively.
|
|
The disassembler can't always discern the value of the data bank
|
|
register through static analysis, so some user input may be required.</p>
|
|
|
|
<p>The 6502 has an 8-bit processor status register ("P") with a bunch of flags
|
|
in it. Some of the flags determine whether a conditional branch is taken
|
|
or not, which is important because some branches appear to be conditional
|
|
but actually are always or never taken in practice. The disassembler needs
|
|
to be able to figure this out so that it doesn't try to disassemble the
|
|
bytes that follow an always-taken branch.
|
|
A more significant concern is the M and X flags found on the 65802/65816,
|
|
which determine the width of the registers and of immediate load
|
|
instructions. If you don't know what state the flags are in, you can't
|
|
know whether <code>LDA #value</code> is two bytes or three, and the
|
|
disassembly of the instruction stream will come out wrong.</p>
|
|
|
|
<p>Some addresses correspond to memory-mapped I/O, rather than RAM or ROM.
|
|
Accessing the address can have side effects, like changing between text
|
|
and graphics modes. Sometimes reading and writing have different effects.
|
|
For example, on later models of the Apple II, reading from
|
|
$C000 returns the most recently hit key, while writing to $C000 changes
|
|
how 80-column display memory is mapped.</p>
|
|
<p>On a few systems, such as the Atari 2600, RAM, ROM, and registers can
|
|
appear at multiple locations, "mirrored" across the address space.</p>
|
|
|
|
<h3><a name="charenc">Character Encoding</a></h3>
|
|
|
|
<p>The American Standard Code for Information Interchange (ASCII) was
|
|
developed in the 1960s, and became widely used as the means for representing
|
|
text data on a computer. It's compatible with Unicode, in that the
|
|
binary representation of an ASCII string is exactly the same when
|
|
expressed as a Unicode string with UTF-8 encoding.</p>
|
|
<p>Not all 6502-based computers used ASCII, notably those from Commodore
|
|
International (e.g. PET, VIC-20, 64, 128), which used variants
|
|
collectively known as "PETSCII". PETSCII had most of the same symbols,
|
|
but rearranged them, and added a number of graphical symbols. This was
|
|
further complicated by the use of two different character sets, one of
|
|
which dropped lower-case letters in favor of additional symbols, and
|
|
the use of a separate encoding for characters stored in the text frame
|
|
buffer ("screen codes").</p>
|
|
<p>Apple II computers were based on ASCII, but tended to store bytes
|
|
with the high bit set rather than clear. This is known as "high ASCII".</p>
|
|
|
|
<p>SourceGen allows you to specify that a string is encoded with ASCII,
|
|
High ASCII, C64 PETSCII, or C64 Screen Codes. Because the goal is to
|
|
generate assembly sources for cross-assemblers, the C64 character
|
|
support is limited to the set that overlaps with ASCII.</p>
|
|
<p>For the most part only printable characters are accepted in strings,
|
|
but certain control characters are also allowed. The characters for
|
|
bell ($07), linefeed ($0a), and carriage return ($0d) are recognized as
|
|
string data, and in C64 PETSCII a number of text color and formatting
|
|
control codes are also allowed.</p>
|
|
|
|
<h3><a name="sgconcepts">SourceGen Concepts</a></h3>
|
|
|
|
<p>As you work on a disassembled file, formatting operands and adding
|
|
comments, everything you do is saved in the project file as "meta data".
|
|
None of the data from the file being disassembled is included. This
|
|
should allow project files to be shared without violating the copyright
|
|
of the work being disassembled. (This will vary by region. Also, note
|
|
that the mere act of disassembling a piece of software may be illegal in
|
|
some cases.)</p>
|
|
|
|
<p>To avoid mix-ups where the wrong data file is used, the file's length
|
|
and CRC are stored in the project file. SourceGen will refuse to open a
|
|
project if the data file's length and CRC don't match.</p>
|
|
|
|
<p>Most of the data in the project file is associated with a file offset.
|
|
When you create a comment, you aren't associating it with line 53, you're
|
|
associating it with the 127th byte in the file. This ensures that, as the
|
|
project evolves, the comment you wrote is always connected to the
|
|
same instruction or data item. This also means you can't have two
|
|
comments on the same line -- each offset only has room for one. By
|
|
convention, file offsets are always shown as a six-digit hexadecimal value
|
|
with a leading '+', e.g. "+0012ab". This makes it easy to distinguish
|
|
between an address and a offset.</p>
|
|
|
|
<p>Instruction and data operands can be formatted in various ways. The
|
|
formatting choice is associated with the first offset of the item. For
|
|
instructions the number of bytes in the operand is determined by the opcode
|
|
(and, on the 65816, the M/X status flags). For data items the length
|
|
can be a single byte or an entire file. Operand formats are not allowed
|
|
to overlap.</p>
|
|
|
|
<p>When an instruction or data operand references an address, we call
|
|
it a <b>numeric reference</b>. When the target address has a label, and
|
|
the operand uses that symbol, we call that a <b>symbolic reference</b>.
|
|
SourceGen tries to establish symbolic references whenever possible,
|
|
so that the generated assembly source doesn't refer to hard-coded
|
|
locations within the program. Labels are generated automatically for
|
|
the targets of numeric references.</p>
|
|
|
|
<p>As your understanding of the disassembled code develops, you will want
|
|
to add comments explaining it. SourceGen projects have three kinds of
|
|
comments:</p>
|
|
<ol>
|
|
<li>End-of-line comments. As the name implies, these appear at the
|
|
end of a line, to the right of the opcode or operand.</li>
|
|
<li>Long comments, also known as multi-line comments. These get a
|
|
line all to themselves, and may span multiple lines.</li>
|
|
<li>Notes. Like long comments, these get a line to themselves. Unlike
|
|
long comments, these do not appear in generated assembly code. They
|
|
are a way for you to leave notes to yourself, perhaps "don't forget
|
|
to figure this out" or "this is the cool part".</li>
|
|
</ol>
|
|
<p>Every file offset can have one of each.</p>
|
|
|
|
<p>Labels and comments may disappear if you associate them with a file
|
|
offset that is in the middle of a multi-byte instruction or data item.
|
|
For example, suppose you put a long comment at offset +000010, and then
|
|
mark a 50-byte region starting at offset +000008 as an ASCII string. The
|
|
comment won't be deleted, but won't be displayed either. The same thing
|
|
can happen to labels. SourceGen will try to prevent this from happening
|
|
by splitting formatted data into sub-regions at label boundaries.</p>
|
|
|
|
|
|
<h2><a name="sgintro">How SourceGen Works</a></h2>
|
|
|
|
<p>SourceGen employs a partial emulation technique that traces the flow
|
|
of execution through the program. Most of what a given instruction does
|
|
isn't important; only its effect on the flow of execution matters. This
|
|
makes SourceGen different from most other disassemblers, because instead
|
|
of assuming everything is code and expecting the user to separate out the
|
|
data, it assumes everything is data and asks the user to identify where the
|
|
code starts executing.</p>
|
|
|
|
<p>SourceGen uses "code start points" to tag places where execution may
|
|
begin. By default, the first byte of the file is marked as a start point.
|
|
From there, the tracing process walks through the code, pursuing all
|
|
branches. In many cases, if you tag all external entry points, SourceGen
|
|
will automatically find all executable code and separate it from variable
|
|
storage and data areas.</p>
|
|
|
|
<p>As noted earlier, tracking the processor status flags can make the
|
|
analysis more accurate. Identifying situations where a branch instruction
|
|
is always or never taken avoids mis-categorizing a data region as code.
|
|
On the 65816, it's absolutely crucial to track the M/X flags, since those
|
|
affect the width of instructions. SourceGen tracks the value of the
|
|
processor flags at every instruction, blending sets of flags together when
|
|
multiple paths of execution converge.</p>
|
|
|
|
<p>Once instructions and data have been separated, the instruction operands
|
|
can be examined. Branches, loads, and stores that reference an address
|
|
that falls inside the address space covered by the file can be replaced
|
|
with a symbol. Operands that refer to addresses outside the file, such
|
|
as ROM or operating system routines, can be replaced with a symbol defined
|
|
by an equate directive.</p>
|
|
|
|
(For more details on how this works, see the
|
|
<a href="analysis.html">analysis appendix</a>.)
|
|
|
|
</div>
|
|
|
|
<div id="footer">
|
|
<p><a href="index.html">Back to index</a></p>
|
|
</div>
|
|
</body>
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
</html>
|