mirror of
https://github.com/fadden/6502bench.git
synced 2025-01-16 19:32:31 +00:00
28eafef27c
Extension scripts (a/k/a "plugins") can now apply any data format supported by FormatDescriptor to inline data. In particular, it can now handle variable-length inline strings. The code analyzer verifies the string structure (e.g. null-terminated strings have exactly one null byte, at the very end). Added PluginException to carry an exception back to the plugin code, for occasions when they're doing something so wrong that we just want to smack them. Added test 2022-extension-scripts to exercise the feature.
346 lines
17 KiB
HTML
346 lines
17 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<head>
|
|
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
<link href="main.css" rel="stylesheet" type="text/css" />
|
|
<title>Instruction and Data Analysis - 6502bench SourceGen</title>
|
|
</head>
|
|
|
|
<body>
|
|
<div id="content">
|
|
<h1>6502bench SourceGen: Instruction and Data Analysis</h1>
|
|
<p><a href="index.html">Back to index</a></p>
|
|
|
|
<p><i>This section discusses the internal workings of SourceGen. It is
|
|
not necessary to understand this to use the program.</i></p>
|
|
|
|
<h2><a name="analysis-process">Analysis Process</a></h2>
|
|
|
|
<p>Analysis of the file data is a complex multi-step process. Some
|
|
changes to the project, such as adding a code entry point hint or
|
|
changing the CPU selection, require a full re-analysis of instructions
|
|
and data. Other changes, such as adding or removing a label, don't
|
|
affect the code tracing and only require a re-analysis of the data areas.
|
|
And some changes, such as editing a comment, only require a refresh
|
|
of the displayed lines.</p>
|
|
<p>It should be noted that none of the analysis results are stored in
|
|
the project file. Only user-supplied data, such as the locations of
|
|
code entry points and label definitions, is written to the file. This
|
|
does create the possibility that two different users might get different
|
|
results when opening the same project file with different versions of
|
|
SourceGen, but these effects are expected to be minor.</p>
|
|
|
|
<p>The analyzer has the following steps (see the <code>Analyze</code>
|
|
method in <code>DisasmProject.cs</code>):</p>
|
|
<ul>
|
|
<li>Reset the symbol table.</li>
|
|
<li>Merge platform symbols into the symbol table, loading the files
|
|
in order.</li>
|
|
<li>Merge project symbols into the symbol table, stomping on any
|
|
platform symbols that conflict.</li>
|
|
<li>Merge user label symbols into the table, stomping any previous
|
|
entries.</li>
|
|
<li>Run the code analyzer. The outcome of this is an array of analysis
|
|
attributes, or "anattribs", with one entry per byte in the file.
|
|
The Anattrib array tracks most of the state from here on. If we're
|
|
doing a partial re-analysis, this step will just clone a copy of the
|
|
Anattrib array that was made at this point in a previous run. (The
|
|
code analysis pass is described in more detail below.)</li>
|
|
<li>Apply user-specified labels to Anattribs.</li>
|
|
<li>Apply user-specified format descriptors. These are the instruction
|
|
and data operand formats.</li>
|
|
<li>Run the data analyzer. This looks for patterns in uncategorized
|
|
data, and connects instruction and data operands to target offsets.
|
|
The "nearby label" stuff is handled here. Auto-labels are generated
|
|
for references to internal addresses. All of the results are
|
|
stored in the Anattribs array. (The data analysis pass is described in
|
|
more detail below.)</li>
|
|
<li>Remove hidden labels from the symbol table. These are user-specified
|
|
labels that have been placed on offsets that are in the middle of an
|
|
instruction or multi-byte data item. They can't be referenced, so we
|
|
want to pull them out of the symbol table. (Remember, symbolic
|
|
operands use "weak references", so a missing symbol just means the
|
|
operand is shown as a hex value.)</li>
|
|
<li>Resolve references to local variables. This sets the operand symbol
|
|
in Anattrib so we won't try to apply platform/project symbols to
|
|
zero-page addresses. If we somehow ended up with a variable that has
|
|
the same as a user label, we rename the variable.</li>
|
|
<li>Resolve references to platform and project external symbols.
|
|
This sets the operand symbol in Anattrib, and adds the symbol to
|
|
the list that is displayed in .EQ directives.</li>
|
|
<li>Generate cross-reference lists. This is done for internal references,
|
|
for local variables, and for any platform/project symbols that are
|
|
referenced.</li>
|
|
<li>If annotated auto-labels are enabled, the simple labels are
|
|
replaced with the annotated versions here. (This can't be done earlier
|
|
because the annotations are generated from the cross-reference data.)</li>
|
|
<li>In a debug build, some validity checks are performed.</li>
|
|
</ul>
|
|
|
|
<p>Once analysis is complete, a line-by-line display list is generated
|
|
by walking through the annotated file data. Most of the actual text
|
|
isn't rendered until they're needed. For complicated multi-line items
|
|
like string operands, the formatted text must be generated to know how
|
|
many lines it will occupy, so it's done immediately and cached for re-use
|
|
on subsequent runs.</p>
|
|
|
|
|
|
<h3><a name="auto-format">Automatic Formatting</a></h3>
|
|
|
|
<p>Every offset in the file is marked as an instruction byte, data byte, or
|
|
inline data byte. Some offsets are also marked as the start of an instruction
|
|
or data area. The start offsets may have a format descriptor associated
|
|
with them.</p>
|
|
<p>Format descriptors have a format (like "numeric" or
|
|
"null-terminated string") a sub-format (like "hexadecimal" or
|
|
"high ASCII"), and a length. For
|
|
an instruction operand the length is redundant, but for a data operand it
|
|
determines the width of the numeric value or length of the string. For
|
|
this reason, instructions do not need a format descriptor, but all
|
|
data items do.</p>
|
|
<p>Symbolic references are format descriptors with a symbol attached.
|
|
The symbol reference also specifies low/high/bank, for partial symbol
|
|
references like <code>LDA #>symbol</code>.</p>
|
|
<p>Every offset marked as a start point gets its own line in the on-screen
|
|
display list. Embedded instructions are identified internally by
|
|
looking for instruction-start offsets inside instructions.</p>
|
|
|
|
<p>The Anattrib array holds the post-analysis state for every offset,
|
|
including comments and formatting, but any changes you make in the
|
|
editors are applied to the data structures that are saved in the project
|
|
file. After a change is made, a full or partial re-analysis is done to
|
|
fill out the Anattribs.</p>
|
|
<p>Consider a simple example:</p>
|
|
<pre>
|
|
.ORG $1000
|
|
JMP L1003
|
|
L1003 NOP
|
|
</pre>
|
|
|
|
<p>We haven't explicitly formatted anything yet. The data analyzer sees
|
|
that the JMP operand is inside the file, and has no label, so it creates an
|
|
auto-label at offset +000003 and a format descriptor with a symbolic
|
|
operand reference to "L1003" at +000000.</p>
|
|
<p>Suppose we now edit the label, changing L1003 to "FOO". This goes into
|
|
the project's "user label" list. The analyzer is
|
|
run, and applies the new "user label" to the Anattrib array. The
|
|
data analyzer finds the numeric reference in the JMP operand, and finds
|
|
a label at the target address, so it creates a symbolic operand reference
|
|
to "FOO". When the display list is generated, the symbol "FOO" appears
|
|
in both places.</p>
|
|
<p>Even though the JMP operand changed from "L1003" to "FOO", the only
|
|
change actually written to the project file is the label edit. The
|
|
contents of the Anattrib array are disposable, so it can be used to
|
|
hold auto-generated labels and "fix up" numeric references. Labels and
|
|
format descriptors generated by SourceGen are never added to the
|
|
project file.</p>
|
|
|
|
<p>If the JMP operand were edited, a format descriptor would be added
|
|
to the user-specified descriptor list. During the analysis pass it would
|
|
be added to the Anattrib array at offset +000000.</p>
|
|
|
|
|
|
<h3><a name="undo-redo">Interaction With Undo/Redo</a></h3>
|
|
|
|
<p>The analysis pass always considers the current state of the user
|
|
data structures. Whether you're adding a label or removing one, the
|
|
code runs through the same set of steps. The advantage of this approach
|
|
is that the act of doing a thing, undoing a thing, and redoing a thing
|
|
are all handled the same way.</p>
|
|
<p>None of the editors modify the project data structures directly. All
|
|
changes are added to a change set, which is processed by a single
|
|
"apply changes" function. The change sets are kept in the undo/redo
|
|
buffer indefinitely. After
|
|
the changes are made, the Anattrib array and other data structures are
|
|
regenerated.</p>
|
|
|
|
<p>Data format editing can create some tricky situations. For example,
|
|
suppose you have 8 bytes that have been formatted as two 32-bit words:</p>
|
|
|
|
<pre>
|
|
1000: 68690074 .dd4 $74006968
|
|
1004: 65737400 .dd4 $00747365
|
|
</pre>
|
|
|
|
<p>You realize these are null-terminated strings, select both words, and
|
|
reformat them:</p>
|
|
|
|
<pre>
|
|
1000: 686900 .zstr "hi"
|
|
1003: 74657374+ .zstr "test"
|
|
</pre>
|
|
|
|
<p>Seems simple enough. Under the hood, SourceGen created three changes:</p>
|
|
<ol>
|
|
<li>At offset +000000, replace the current format descriptor (4-byte
|
|
numeric) with a 3-byte null-terminated string descriptor.</li>
|
|
<li>At offset +000003, add a new 5-byte null-terminated string
|
|
descriptor.</li>
|
|
<li>At offset +000004, remove the 4-byte numeric descriptor.</li>
|
|
</ol>
|
|
|
|
<p>Each entry in the change set has "before" and "after" states for the
|
|
format descriptor at a specific offset. Only the state for the affected
|
|
offsets is included -- the program doesn't record the state of the full
|
|
project after each change (even with the RAM on a modern system that would
|
|
add up quickly). When undoing a change, before and after are simply
|
|
reversed.</p>
|
|
|
|
|
|
<h2><a name="code-analysis">Code Analysis</a></h2>
|
|
|
|
<p>The code tracer walks through the instructions, examining them to
|
|
determine where execution will proceed next. There are five possibilities
|
|
for every instruction:</p>
|
|
<ol>
|
|
<li>Continue. Execution always continues at the next instruction.
|
|
Examples: <code>LDA</code>, <code>STA</code>, <code>AND</code>,
|
|
<code>NOP</code>.</li>
|
|
<li>Don't continue. The next instruction to be executed can't be
|
|
determined from the file data (unless you're disassembling the
|
|
system ROM around the BRK vector).
|
|
Examples: <code>RTS</code>, <code>BRK</code>.</li>
|
|
<li>Branch always. The operand specifies the next instruction address.
|
|
Examples: <code>JMP</code>, <code>BRA</code>, <code>BRL</code>.</li>
|
|
<li>Branch sometimes. Execution may continue at the operand address,
|
|
or may execute the following instruction. If we know the value of
|
|
the flags in the processor status register, we can eliminate one
|
|
possibility. Examples: <code>BCC</code>, <code>BEQ</code>,
|
|
<code>BVS</code>.</li>
|
|
<li>Call subroutine. Execution will continue at the operand address,
|
|
and is expected to also continue at the following instruction.
|
|
Examples: <code>JSR</code>, <code>JSL</code>.</li>
|
|
</ol>
|
|
|
|
<p>Branch targets are added to a list. When the current run of instructions
|
|
is exhausted (i.e. a "don't continue" or "branch always" instruction is
|
|
reached), the next target is pulled off of the list.</p>
|
|
|
|
<p>The state of the processor status flags is recorded for every
|
|
instruction. When execution proceeds to the next instruction or branches
|
|
to a new address, the flags are merged with the flags at the new
|
|
location. If one execution path through a given address has the flags
|
|
in one state (say, the carry is clear), while another execution path
|
|
sees a different state (carry is set), the merged flag is
|
|
"indeterminate". Indeterminate values cannot become determinate through
|
|
a merge, but can be set by an instruction.</p>
|
|
|
|
<p>There can be multiple paths to a single address. If the analyzer
|
|
sees that an instruction has been visited before, with an identical set
|
|
of status flags, the analyzer stops pursuing that path.</p>
|
|
|
|
<p>The analyzer must always know the width of immediate load instructions
|
|
when examining 65816 code, but it's possible for the status flag values
|
|
to be indeterminate. In such a situation, short registers are assumed.
|
|
Similarly, if the carry flag is unknown when an <code>XCE</code> is
|
|
performed, we assume a transition to emulation mode (E=1).</p>
|
|
|
|
<p>There are three ways in which code can set a flag to a definite value:</p>
|
|
<ol>
|
|
<li>With explicit instructions, like <code>SEC</code> or
|
|
<code>CLD</code>.</li>
|
|
<li>With immediate-operand instructions. <code>LDA #$00</code> sets Z=1
|
|
and N=0. <code>ORA #$80</code> sets Z=0 and N=1.</li>
|
|
<li>By inference. For example, if we see a <code>BCC</code> instruction,
|
|
we know that the carry will be clear at the branch target address, and
|
|
set at the following instruction. The instruction doesn't affect the
|
|
value of the flag, but we know what the value will be at both
|
|
addresses.</li>
|
|
</ol>
|
|
<p>Self-modifying code can render spoil any of these, possibly requiring a
|
|
status flag override to get correct disassembly.</p>
|
|
|
|
<p>The instruction that is most likely to cause problems is <code>PLP</code>,
|
|
which pulls the processor status flags off of the stack. SourceGen
|
|
doesn't try to track stack contents, so it can't know what values may
|
|
be pulled. In many cases the <code>PLP</code> appears not long after a
|
|
<code>PHP</code>, so SourceGen will scan backward through the file to
|
|
find the nearest <code>PHP</code>, and use the status flags from that. If
|
|
no <code>PHP</code> can be found, then all
|
|
flags are set to "indeterminate". (The boot loader in the Apple II 5.25"
|
|
floppy disk controller is an example where SourceGen gets it wrong. The
|
|
code does <code>CLC</code>/<code>PHP</code>, followed a bit later by the
|
|
<code>PLP</code>, but it's actually using the stack to pass the carry
|
|
flag around. Flagging the carry bit as indeterminate with a status flag
|
|
override on the instruction following the PLP fixes things.) The
|
|
"smart" behavior can be disabled in the project properties if it's coming
|
|
out wrong more often than right.</p>
|
|
|
|
<p>Some other things that the code analyzer can't recognize automatically:</p>
|
|
<ul>
|
|
<li>Jumping indirectly through an address outside the file, e.g.
|
|
storing an address in zero-page memory and jumping through it.</li>
|
|
<li>Jumping to an address by pushing the location onto the stack,
|
|
then executing an <code>RTS</code>.</li>
|
|
<li>Self-modifying code, e.g. overwriting a <code>JMP</code> instruction.</li>
|
|
<li>Addresses invoked by external code, e.g. interrupt handlers.</li>
|
|
</ul>
|
|
<p>Sometimes the indirect jump targets are coming from a table of
|
|
addresses in the file. If so, these can be formatted as addresses,
|
|
and then the target locations hinted as code entry points.</p>
|
|
<p>The 65816 adds an additional twist: 16-bit data access instructions
|
|
use the data bank register ("B") to determine which bank to load from.
|
|
SourceGen can't determine what the value is, so it currently assumes
|
|
that it's equal to the program bank register ("K"). Handling this
|
|
correctly will require improvements to the user interface.</p>
|
|
|
|
|
|
<h3><a name="extension-scripts">Extension Scripts</a></h3>
|
|
|
|
<p>Extension scripts can mark data that follows a JSR, JSL, or BRK as inline
|
|
data, or change the format of nearby data or instructions. The first
|
|
time a JSR/JSL/BRK instruction is encountered, all loaded extension scripts
|
|
that implement the appropriate interface are offered a chance to act.</p>
|
|
|
|
<p>The first script that applies a format wins. Attempts to re-format
|
|
instructions or data that have already been formatted will fail. This rule
|
|
ensures that anything explicitly formatted by the user will not be
|
|
overridden by a script.</p>
|
|
|
|
<p>If code jumps into a region that is marked as inline data, the
|
|
branch will be ignored. If an extension script tries to flag bytes
|
|
as inline data that have already been executed, the script will be
|
|
ignored. This can lead to a race condition in the analyzer if
|
|
an extension script is doing the wrong thing. (The race doesn't exist
|
|
with inline data hints specified by the user, because those are applied
|
|
before code analysis starts.)</p>
|
|
|
|
|
|
<h2><a name="data-analysis">Data Analysis</a></h2>
|
|
<p>The data analyzer performs two tasks. It matches operands with
|
|
offsets, and it analyzes uncategorized data. This behavior can be
|
|
modified in the
|
|
<a href="settings.html#project-properties">project properties</a>.</p>
|
|
|
|
<p>The data target analyzer examines every instruction and data operand
|
|
to see if it's referring to an offset within the data file. If the
|
|
target is within the file, and has a label, a format descriptor with a
|
|
weak symbolic reference to that label is added to the Anattrib array. If
|
|
the target doesn't have a label, the analyzer will either use a nearby
|
|
label, or generate a unique label and use that.</p>
|
|
<p>While most of the "nearby label" logic can be disabled, targets that
|
|
land in the middle of an instruction are always adjusted backward to
|
|
the instruction start. This is necessary because labels are only visible
|
|
if they're associated with the first (opcode) byte of an instruction.</p>
|
|
|
|
<p>The uncategorized data analyzer tries to find character strings and
|
|
opportunities to use the ".FILL" operation. It breaks the file into
|
|
pieces, where contiguous regions hold nothing but data, are not split
|
|
across a ".ORG" directive, are not interrupted by data, and do not
|
|
contain anything that the user has chosen to format. Each region is
|
|
scanned for matching patterns. If a match is found, a format entry
|
|
is added to the Anattrib array. Otherwise, data is added as single-byte
|
|
values.</p>
|
|
|
|
|
|
</div>
|
|
|
|
<div id="footer">
|
|
<p><a href="index.html">Back to index</a></p>
|
|
</div>
|
|
</body>
|
|
<!-- Copyright 2018 faddenSoft -->
|
|
</html>
|