2018-09-28 17:05:11 +00:00
|
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
|
|
|
|
<head>
|
|
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
|
|
<title>Instruction and Data Analysis - 6502bench SourceGen</title>
|
|
|
|
</head>
|
|
|
|
|
|
|
|
<body>
|
|
|
|
<div id=content>
|
|
|
|
<h1>6502bench SourceGen: Instruction and Data Analysis</h1>
|
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
|
|
|
|
<h2><a name="analysis-process">Analysis Process</a></h2>
|
|
|
|
<p>Analysis of the file data is a complex multi-step process. Some
|
|
|
|
changes to the project, such as adding a code entry point hint or
|
|
|
|
changing the CPU selection, require a full re-analysis of instructions
|
|
|
|
and data. Other changes, such as adding or removing a label, don't
|
|
|
|
affect the code tracing and only require a re-analysis of the data areas.
|
|
|
|
And some changes, such as editing a comment, only require a refresh
|
|
|
|
of the displayed lines.</p>
|
|
|
|
<p>It should be noted that none of the analysis results are stored in
|
|
|
|
the project file. Only user-supplied data, such as the locations of
|
|
|
|
code entry points and label definitions, is written to the file. This
|
|
|
|
does create the possibility that two different users might get different
|
|
|
|
results when opening the same project file with different versions of
|
|
|
|
SourceGen, but these effects are expected to be minor.</p>
|
|
|
|
|
|
|
|
<p>The analyzer has the following steps (see the <code>Analyze</code>
|
|
|
|
method in <code>DisasmProject.cs</code>):</p>
|
|
|
|
<ul>
|
|
|
|
<li>Reset the symbol table.</li>
|
|
|
|
<li>Merge platform symbols into the symbol table, loading the files
|
|
|
|
in order.</li>
|
|
|
|
<li>Merge project symbols into the symbol table, stomping on any
|
|
|
|
platform symbols that conflict.</li>
|
|
|
|
<li>Merge user label symbols into the table, stomping any previous
|
|
|
|
entries.</li>
|
|
|
|
<li>Run the code analyzer. The outcome of this is an array of analysis
|
|
|
|
attributes, or "anattribs", with one entry per byte in the file.
|
|
|
|
The Anattrib array tracks most of the state from here on. If we're
|
|
|
|
doing a partial re-analysis, this step will just clone a copy of the
|
|
|
|
Anattrib array that was made at this point in a previous run. (This
|
2018-10-04 01:03:04 +00:00
|
|
|
step is described in more detail below.)</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>Apply user-specified labels to Anattribs.</li>
|
|
|
|
<li>Apply user-specified format descriptors. These are the instruction
|
|
|
|
and data operand formats.</li>
|
|
|
|
<li>Run the data analyzer. This looks for patterns in uncategorized
|
|
|
|
data, and connects instruction and data operands to target offsets.
|
|
|
|
The "nearby label" stuff is handled here. All of the results are
|
|
|
|
stored in the Anattribs array. (This step is described in more
|
2018-10-04 01:03:04 +00:00
|
|
|
detail below.)</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>Remove hidden labels from the symbol table. These are user-specified
|
|
|
|
labels that have been placed on offsets that are in the middle of an
|
|
|
|
instruction or multi-byte data item. They can't be referenced, so we
|
|
|
|
want to pull them out of the symbol table. (Remember, symbolic
|
|
|
|
operands use "weak references", so a missing symbol just means the
|
|
|
|
operand is shown as a hex value.)</li>
|
2018-10-04 01:03:04 +00:00
|
|
|
<li>Resolve references to platform and project external symbols.
|
2018-09-28 17:05:11 +00:00
|
|
|
This sets the operand symbol in Anattrib, and adds the symbol to
|
|
|
|
the list that is displayed in .EQ directives.</li>
|
|
|
|
<li>Generate cross-reference lists. This is done for file data and
|
|
|
|
for any platform/project symbols that are referenced.</li>
|
|
|
|
<li>In a debug build, some validity checks are performed.</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<p>Once analysis is complete, a line-by-line display list is generated
|
|
|
|
by walking through the annotated file data. Most of the actual strings
|
|
|
|
aren't rendered until they're needed.</p>
|
|
|
|
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<h3><a name="auto-format">Automatic Formatting</a></h3>
|
|
|
|
|
|
|
|
<p>Every offset in the file is marked as an instruction byte, data byte, or
|
|
|
|
inline data byte. Some offsets are also marked as the start of an instruction
|
|
|
|
or data area. The start offsets may have a format descriptor associated
|
|
|
|
with them.</p>
|
|
|
|
<p>Format descriptors have a format (like "numeric" or "string") a
|
|
|
|
sub-format (like "hexadecimal" or "null-terminated"), and a length. For
|
|
|
|
an instruction operand the length is redundant, but for a data operand it
|
|
|
|
determines the width of the numeric value or length of the string. For
|
|
|
|
this reason, instructions do not need a format descriptor, but all
|
|
|
|
data items do.</p>
|
|
|
|
<p>Symbolic references are format descriptors with a symbol attached.
|
|
|
|
The symbol reference also specifies low/high/bank.</p>
|
|
|
|
<p>Every offset marked as a start point gets its own line in the on-screen
|
|
|
|
display list. Embedded instructions are identified internally by
|
|
|
|
looking for instruction-start offsets inside instructions.</p>
|
|
|
|
|
|
|
|
<p>The Anattrib array holds the post-analysis state for every offset,
|
|
|
|
including comments and formatting, but any changes you make in the
|
|
|
|
editors are applied to the data structures that are saved in the project
|
|
|
|
file. After a change is made, a full or partial re-analysis is done to
|
|
|
|
fill out the Anattribs.</p>
|
|
|
|
<p>Consider a simple example:</p>
|
|
|
|
<pre>
|
|
|
|
.ORG $1000
|
|
|
|
JMP L1003
|
|
|
|
L1003 NOP
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>We haven't formatted anything yet. The data analyzer sees that the
|
|
|
|
JMP operand is inside the file, and has no label, so it creates an
|
|
|
|
auto-label at offset +000003 and a format descriptor with a symbolic
|
|
|
|
operand reference to "L1003" at +000000.</p>
|
|
|
|
<p>Now we edit the label, changing L1003 to "FOO". This goes into the
|
|
|
|
project's "user label" list. The analyzer is
|
|
|
|
run, and applies the new "user label" to the Anattrib array. The
|
|
|
|
data analyzer finds the numeric reference in the JMP operand, and finds
|
|
|
|
a label at the target address, so it creates a symbolic operand reference
|
|
|
|
to "FOO". When the display list is generated, the symbol "FOO" appears
|
|
|
|
in both places.</p>
|
|
|
|
<p>Even though the JMP operand changed from "L1003" to "FOO", the only
|
|
|
|
change actually written to the project file is the label edit. The
|
|
|
|
contents of the Anattrib array are disposable, so it can be used to
|
|
|
|
add labels and "fix up" numeric references. Generated labels and
|
|
|
|
format descriptors are never added to the project file.</p>
|
|
|
|
|
|
|
|
<p>If the JMP operand were edited, a format descriptor would be added
|
|
|
|
to the user-specified descriptor list. During the analysis pass it would
|
|
|
|
be added to the Anattrib array at offset +000000.</p>
|
|
|
|
|
|
|
|
|
|
|
|
<h3><a name="undo-redo">Interaction With Undo/Redo</a></h3>
|
|
|
|
|
|
|
|
<p>The analysis pass always considers the current state of the user
|
|
|
|
data structures. Whether you're adding a label or removing one, the
|
|
|
|
code runs through the same set of steps. The advantage of this approach
|
|
|
|
is that the act of doing a thing, undoing a thing, and redoing a thing
|
|
|
|
are all handled the same way.</p>
|
|
|
|
<p>None of the editors modify the project data structures directly. All
|
|
|
|
changes are added to a change set, which is processed by a single function.
|
|
|
|
The change sets are kept in the undo/redo buffer indefinitely. After
|
|
|
|
the changes are made, the Anattrib array and other data structures are
|
|
|
|
regenerated.</p>
|
|
|
|
|
|
|
|
<p>Data format editing can create some tricky situations. For example,
|
|
|
|
suppose you have 8 bytes that have been formatted as two 32-bit words:
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
1000: 68690074 .dd4 $74006968
|
|
|
|
1004: 65737400 .dd4 $00747365
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
You realize these are null-terminated strings, select both words, and
|
|
|
|
reformat them:
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
1000: 686900 .zstr "hi"
|
|
|
|
1003: 74657374+ .zstr "test"
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
Seems simple enough. Under the hood, SourceGen created three changes:
|
|
|
|
<ol>
|
|
|
|
<li>At offset +000000, replace the current format descriptor (4-byte
|
|
|
|
numeric) with a 3-byte null-terminated string descriptor.</li>
|
|
|
|
<li>At offset +000003, add a new 5-byte null-terminated string
|
|
|
|
descriptor.</li>
|
|
|
|
<li>At offset +000004, remove the 4-byte numeric descriptor.</li>
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
<p>Each entry in the change set has "before" and "after" states for the
|
|
|
|
format descriptor at a specific offset. Only the state for the affected
|
|
|
|
offsets is included -- the program doesn't take a complete state snapshot
|
|
|
|
(even with the RAM on a modern system that would add up quickly). When
|
|
|
|
undoing a change, before and after are simply reversed.</p>
|
|
|
|
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<h2><a name="code-analysis">Code Analysis</a></h2>
|
|
|
|
|
|
|
|
<p>The code tracer walks through the instructions, examining them to
|
|
|
|
determine where execution will proceed next. There are five possibilities
|
|
|
|
for every instruction:</p>
|
|
|
|
<ol>
|
|
|
|
<li>Continue. Execution always continues at the next instruction.
|
|
|
|
Examples: <code>LDA</code>, <code>STA</code>, <code>AND</code>,
|
|
|
|
<code>NOP</code>.
|
|
|
|
<li>Don't continue. The next instruction to be executed can't be
|
2018-10-04 01:03:04 +00:00
|
|
|
determined from the file data (unless you're disassembling the
|
|
|
|
system ROM around the BRK vector).
|
|
|
|
Examples: <code>RTS</code>, <code>BRK</code>.
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>Branch always. The operand specifies the next instruction address.
|
|
|
|
Examples: <code>JMP</code>, <code>BRA</code>, <code>BRL</code>.
|
|
|
|
<li>Branch sometimes. Execution may continue at the operand address,
|
|
|
|
or may execute the following instruction. If we know the value of
|
|
|
|
the flags in the processor status register, we can eliminate one
|
|
|
|
possibility. Examples: <code>BCC</code>, <code>BEQ</code>,
|
|
|
|
<code>BVS</code>.
|
|
|
|
<li>Call subroutine. Execution will continue at the operand address,
|
|
|
|
and is expected to also continue at the following instruction.
|
|
|
|
Examples: <code>JSR</code>, <code>JSL</code>.
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
<p>Branch targets are added to a list. When the current run of instructions
|
2018-10-04 01:03:04 +00:00
|
|
|
is exhausted (i.e. a "don't continue" or "branch always" instruction is
|
|
|
|
reached), the next target is pulled off of the list.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
|
|
|
<p>The state of the processor status flags is recorded for every
|
|
|
|
instruction. When execution proceeds to the next instruction or branches
|
|
|
|
to a new address, the flags are merged with the flags at the new
|
|
|
|
location. If one execution path through a given address has the flags
|
|
|
|
in one state (say, the carry is clear), while another execution path
|
|
|
|
sees a different state (carry is set), the merged flag is
|
|
|
|
"indeterminate". Indeterminate values cannot become determinate through
|
|
|
|
a merge, but can be set by an instruction.</p>
|
|
|
|
|
|
|
|
<p>There can be multiple paths to a single address. If the analyzer
|
|
|
|
sees that an instruction has been visited before, with an identical set
|
|
|
|
of status flags, the analyzer stops pursuing that path.</p>
|
|
|
|
|
|
|
|
<p>The analyzer must always know the width of immediate load instructions
|
|
|
|
when examining 65816 code, but it's possible for the status flag values
|
|
|
|
to be indeterminate. In such a situation, short registers are assumed.
|
|
|
|
Similarly, if the carry flag is unknown when an <code>XCE</code> is
|
2018-10-04 01:03:04 +00:00
|
|
|
performed, we assume a transition to emulation mode (E=1).</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>There are three ways in which code can set a flag to a definite value:</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
<ol>
|
2018-10-04 01:03:04 +00:00
|
|
|
<li>By explicit instructions, like <code>SEC</code> or
|
2018-09-28 17:05:11 +00:00
|
|
|
<code>CLD</code>.</li>
|
2018-10-04 01:03:04 +00:00
|
|
|
<li>By immediate-operand instructions. <code>LDA #$00</code> sets Z=1
|
|
|
|
and N=0. <code>ORA #$80</code> sets Z=0 and N=1.</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
<li>By inference. For example, if we see a <code>BCC</code> instruction,
|
|
|
|
we know that the carry will be clear at the branch target address, and
|
|
|
|
set at the following instruction. The instruction doesn't affect the
|
2018-10-04 01:03:04 +00:00
|
|
|
value of the flag, but we know what the value will be at both
|
|
|
|
addresses.</li>
|
2018-09-28 17:05:11 +00:00
|
|
|
</ol>
|
|
|
|
<p>Self-modifying code can render spoil any of these, possibly requiring a
|
|
|
|
status flag override to get correct disassembly.</p>
|
|
|
|
|
|
|
|
<p>The instruction that is most likely to cause problems is <code>PLP</code>,
|
|
|
|
which pulls the processor status flags off of the stack. SourceGen
|
|
|
|
doesn't try to track stack contents, so it can't know what values may
|
|
|
|
be pulled. In many cases the PLP appears not long after a PHP, so
|
|
|
|
SourceGen will scan backward through the file to find the nearest PHP,
|
|
|
|
and use the status flags from that. If no PHP can be found, then all
|
|
|
|
flags are set to "indeterminate". (The boot loader in the Apple II 5.25"
|
|
|
|
floppy disk controller is an example where SourceGen gets it wrong. The
|
|
|
|
code does <code>CLC</code>/<code>PHP</code>, followed a bit later by the
|
|
|
|
<code>PLP</code>, but it's actually using the stack to pass the carry
|
|
|
|
flag around. Flagging the carry bit as indeterminate with a status flag
|
|
|
|
override on the instruction following the PLP fixes things.)</p>
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<p>Some other things that the code analyzer can't recognize automatically:</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
<ul>
|
|
|
|
<li>Jumping indirectly through an address outside the file, e.g.
|
|
|
|
storing an address in zero-page memory and jumping through it.
|
|
|
|
<li>Jumping to an address by pushing the location onto the stack,
|
|
|
|
then executing an <code>RTS</code>.
|
|
|
|
<li>Self-modifying code, e.g. overwriting a <code>JMP</code> instruction.
|
|
|
|
</ul>
|
|
|
|
<p>Sometimes the indirect jump targets are coming from a table of
|
|
|
|
addresses in the file. If so, these can be formatted as addresses,
|
|
|
|
and then the target locations hinted as code entry points.</p>
|
|
|
|
<p>The 65816 adds an additional twist: 16-bit data access instructions
|
|
|
|
use the data bank register ("B") to determine which bank to load from.
|
|
|
|
SourceGen can't determine what the value is, so it currently assumes
|
|
|
|
that it's equal to the program bank register ("K"). Handling this
|
|
|
|
correctly will require improvements to the user interface.</p>
|
|
|
|
|
|
|
|
|
2018-10-04 01:03:04 +00:00
|
|
|
<h3><a name="extension-scripts">Extension Scripts</a></h3>
|
|
|
|
|
|
|
|
<p>Extension scripts can mark data that follows a JSR or JSL as inline
|
|
|
|
data, or change the format of nearby data or instructions. The first
|
|
|
|
time a JSR/JSL instruction is encountered, all loaded extension scripts
|
|
|
|
are offered a chance to act.</p>
|
|
|
|
|
|
|
|
<p>The first script that applies a format wins. Attempts to re-format
|
|
|
|
instructions or data will fail. This rule ensure that anything explicitly
|
|
|
|
formatted by the user will not be overridden by a script.</p>
|
|
|
|
|
|
|
|
<p>If code jumps into a region that is marked as inline data, the
|
|
|
|
branch will be ignored. If an extension script tries to flag bytes
|
|
|
|
as inline data that have already been executed, the script will be
|
|
|
|
ignored. This can lead to a race condition in the analyzer if
|
|
|
|
an extension script is doing the wrong thing. (The race doesn't exist
|
|
|
|
with inline data hints specified by the user, because those are applied
|
|
|
|
before code analysis starts.)</p>
|
|
|
|
|
|
|
|
|
2018-09-28 17:05:11 +00:00
|
|
|
<h2><a name="data-analysis">Data Analysis</a></h2>
|
|
|
|
<p>The data analyzer performs two tasks. It matches operands with
|
|
|
|
offsets, and it analyzes uncategorized data. Either or both of
|
|
|
|
these can be disabled from the
|
|
|
|
<a href="settings.html#project-props">project properties</a></p>
|
|
|
|
|
|
|
|
<p>The data target analyzer examines every instruction and data operand
|
|
|
|
to see if it's referring to an offset within the data file. If the
|
2018-10-04 01:03:04 +00:00
|
|
|
target is within the file, and has a label, a format descriptor with a
|
|
|
|
weak symbolic reference to that label is added to the Anattrib array. If
|
|
|
|
the target doesn't have a label, the analyzer will either use a nearby
|
|
|
|
label, or generate a unique label and use that.</p>
|
2018-09-28 17:05:11 +00:00
|
|
|
<p>While most of the "nearby label" logic can be disabled, targets that
|
|
|
|
land in the middle of an instruction are always adjusted backward to
|
|
|
|
the instruction start. This is necessary because labels are only visible
|
|
|
|
if they're associated with the first (opcode) byte of an instruction.</p>
|
|
|
|
|
|
|
|
<p>The uncategorized data analyzer tries to find ASCII strings and
|
2018-10-04 01:03:04 +00:00
|
|
|
opportunities to use the ".FILL" operation. It breaks the file into
|
2018-09-28 17:05:11 +00:00
|
|
|
pieces, where contiguous regions hold nothing but data, are not split
|
|
|
|
across a ".ORG" directive, are not interrupted by data, and do not
|
|
|
|
contain anything that the user has chosen to format. Each region is
|
|
|
|
scanned for matching patterns. If a match is found, a format entry
|
|
|
|
is added to the Anattrib array. Otherwise, data is added as independent
|
|
|
|
byte values.</p>
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div id=footer>
|
|
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
|
|
</html>
|