1
0
mirror of https://github.com/fadden/6502bench.git synced 2024-11-03 23:06:09 +00:00
6502bench/docs/sgmanual/advanced.html
Andy McFadden 7a7ff44d3a Address region isolation, part 3 (of 3)
If an address resolves to a user label in an isolated region, we
don't want to use it.  However, we still want to try to match it
to a project/platform symbol.

For example, suppose the isolated code wants to reference address
$1C00, which is a memory-mapped I/O location in one area, but a
regular bunch of code in the other.  We don't want it to map to
the regular code, but we do want it to resolve to our table of
platform I/O addresses.

We now handle this correctly.  The regression test has been updated
to check this.  The current implementation does a linear scan through
the symbol table, but I'm hoping this is not a common situation.

The reference manual has been updated to describe the new feature.
2024-05-21 14:14:32 -07:00

558 lines
25 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link rel="stylesheet" href="main.css"/>
<title>Advanced Topics - 6502bench SourceGen</title>
</head>
<body>
<div id="content">
<h1>SourceGen: Advanced Topics</h1>
<p><a href="index.html">Back to index</a></p>
<h2 id="platform-symbols">Platform Symbol Files (.sym65)</h2>
<p>Platform symbol files contain lists of symbols, each of which has a
label and a value. SourceGen comes with a collection of symbols for
popular systems, but you can create your own. This can be handy if a
few different projects are coded against a common library.</p>
<p>If two symbols have the same value, the older symbol is replaced by
the newer one. This is why the order in which symbol files are loaded
matters.</p>
<p>Platform symbol files consist of comments, commands, and symbols.
Blank lines, and lines that begin with a semicolon (';'), are ignored. Lines
that begin with an asterisk ('*') are commands. Three are currently
defined:</p>
<ul>
<li><code>*SYNOPSIS</code> - a short summary of the file contents.</li>
<li><code>*TAG</code> - a tag string to apply to all symbols that follow
in this file.</li>
<li><code>*MULTI_MASK</code> - specify a mask for symbols that appear
at multiple addresses.</li>
</ul>
<p>Tags can be used by extension scripts to identify a subset of symbols.
The symbols are still part of the global set; the tag just provides a
way to extract a subset. Tags should be comprised of non-whitespace ASCII
characters. Tags are global, so use a long, descriptive string. If
<code>*TAG</code> is not followed by a string, the symbols that follow
are treated as untagged.</p>
<p>All other lines are symbols, which have the form:</p>
<pre>
LABEL {=|@|&lt;|&gt;} VALUE [WIDTH] [;COMMENT]
</pre>
<p>The LABEL must be at least two characters long, begin with a letter or
underscore, and consist entirely of alphanumeric ASCII characters
(A-Z, a-z, 0-9) and the underscore ('_'). (This is the same format
required for line labels in SourceGen.)</p>
<p>The next token can be one of:</p>
<ul>
<li><code>@</code>: general addresses</li>
<li><code>&lt;</code>: read-only addresses</li>
<li><code>&gt;</code>: write-only addresses</li>
<li><code>=</code>: constants</li>
</ul>
<p>If an instruction references an address, and that address is outside
the bounds of the file, the list of address symbols (i.e. everything
that's not a constant) will be scanned for a match.
If found, the symbol is applied automatically. You normally want to
use '@', but can use '&lt;' and '&gt;' for memory-mapped I/O locations
that have different behavior depending on whether they are read or
written.</p>
<p>The VALUE is a number in decimal, hexadecimal (with a leading '$'), or
binary (with a leading '%'). The numeric base will be recorded and used when
formatting the symbol in generated output, so use whichever form is most
appropriate. Values are unsigned 24-bit numbers. The special value
"erase" may be used for an address to erase a symbol defined in an earlier
platform file.</p>
<p>The WIDTH is optional, and ignored for constants. It must be a
decimal or hexadecimal value between 1 and 65536, inclusive. If omitted,
the default width is 1.</p>
<p>The COMMENT is optional. If present, it will be saved and used as the
end-of-line comment on the .EQ directive if the symbol is used.</p>
<h4>Using MULTI_MASK</h4>
<p>The multi-address mask is used for systems like the Atari 2600, where
RAM, ROM, and I/O registers appear at multiple addresses. The hardware
looks for certain address lines to be set or clear, and if the pattern
matches, another set of bits is examined to determine which register or
RAM address is being accessed.</p>
<p>This is expressed in symbol files with the MULTI_MASK statement.
Address symbol declarations that follow have the mask set applied. Symbols
whose addresses don't fit the pattern cause a warning and will be
ignored. Constants are not affected.</p>
<p>The mask set is best explained with an example. Suppose the address
pattern for a set of registers is <code>???0 ??1? 1??x xxxx</code>
(where '?' can be any value, 0/1 must be that value, and 'x' means the bit
is used to determine the register).
So any address between $0280-029F matches, as does $23C0-23DF, but
$0480 and $1280 don't. The register number is found in the low five bits.</p>
<p>The corresponding MULTI_MASK line, with values specifed in binary,
would be:</p>
<pre> *MULTI_MASK %0001001010000000 %0000001010000000 %0000000000011111</pre>
<p>The values are CompareMask, CompareValue, and AddressMask. To
determine if an address is in the register set, we check to see if
<code>(address &amp; CompareMask) == CompareValue</code>. If so, we can
extract the register number with <code>(address &amp; AddressMask)</code>.</p>
<p>We don't want to have a huge collection of equates at the top of the
generated source file, so whatever value is used in the symbol declaration
is considered the "canonical" value. All other matching values are output
with an offset.</p>
<p>All mask values must fall between 0 and $00FFFFFF. The set bits in
CompareMask and AddressMask must not overlap, and CompareValue must not
have any bits set that aren't also set in CompareMask.</p>
<p>If an address can be mapped to a masked value and an unmasked value,
the unmasked value takes precedence for exact matches. In the example
above, if you declare <code>REG1 @ $0281</code> outside the MULTI_MASK
declaration, the disassembler will use <code>REG1</code> for all operands
that reference $0281. If other code accesses the same register as $23C1,
the symbol established for the masked value will be used instead.</p>
<p>If there are multiple masked values for a given address, the precedence
is undefined.</p>
<p>To disable the MULTI_MASK and resume normal declarations, write the
tag without arguments:</p>
<pre> *MULTI_MASK</pre>
<h3>Creating a Project-Specific Symbol File</h3>
<p>To create a platform symbol file for your project, just create a new
text file, named with a ".sym65" extension. (If your text editor of choice
doesn't like that, you can put a ".txt" on the end while you're editing.)
Make sure you create it in the same directory where your project file
(the file that ends with ".dis65") lives. Add a <code>*SYNOPSIS</code>,
then add the desired symbols.</p>
<p>Finally, add it to your project. Select
<samp>Edit &gt; Project Properties</samp>,
switch to the <samp>Symbol Files</samp> tab, click
<samp>Add Symbol Files from Project</samp>, and
select your symbol file. It should appear in the list with a
"PROJ:" prefix.</p>
<p>If an example helps, the A2-Amper-fdraw project in the Examples
directory has a project-local symbol file, called "fdraw-exports".
(fdraw-exports is a list of exported symbols from the fdraw library,
for which Amper-fdraw provides an Applesoft BASIC interface.)
<p>NOTE: in the current version of SourceGen, changes to .sym65 files are
not detected automatically. Use <samp>File &gt; Reload External Files</samp>
to import the changes.</p>
<h2 id="extension-scripts">Extension Scripts</h2>
<p>Extension scripts, also called "plugins", are C# programs with access to
the full .NET Standard 2.0 APIs. They're compiled at run time by SourceGen
and executed in a sandbox with security restrictions.</p>
<p>The current interfaces can be used to generate visualizations, to
identify inline data that follows JSR, JSL, or BRK instructions, and to
format operands. The latter can be used to format code and data, e.g.
replacing immediate load operands with symbolic constants.</p>
<p>Scripts may be loaded from the RuntimeData directory, or from the directory
where the project file lives. Attempts to load them from other locations
will fail.</p>
<p>A project may load multiple scripts. The order in which functions are
invoked is not defined.</p>
<h3 id="built-in">Built-In Scripts</h3>
<p>A number of scripts are distributed with SourceGen, and may be used
freely by projects. Most are tailored for a specific platform, e.g.
Apple II ProDOS calls or Atari 2600 graphics.</p>
<p>The <samp>StdInline.cs</samp> script in the <samp>RuntimeData/Common</samp>
directory has some general-purpose inline data formatting functions.
To use them, add the script to the project, then add an appropriate label
to the subroutine that handles the inline data. For example, suppose the
code looks like this:</p>
<pre>
$1000 START JSR L1234
$1003 .STR "hello, world!"
$1010 .DD1 $00
$1011 .DD1 $a9
$1012 .DD1 $55
[...]
$1234 L1234 PLA
[...]
</pre>
<p>The code won't analyze correctly because it will try to follow the
code into the string data. If you include the script, and set the label
at <code>L1234</code> to <code>InAZ_PrintString</code>, the code will
then format correctly:</p>
<pre>
$1000 START JSR InAZ_PrintString
$1003 .ZSTR "hello, world!"
$1011 LDA #$55
[...]
$1234 InAZ_PrintString PLA
[...]
</pre>
<p>The label prefixes currently defined by the script are:</p>
<ul>
<li><code>InAZ_</code> : inline ASCII null-terminated string</li>
<li><code>InA1_</code> : inline ASCII length-delimited string</li>
<li><code>InPZ_</code> : inline PETSCII null-terminated string</li>
<li><code>InP1_</code> : inline PETSCII length-delimited string</li>
<li><code>InW_</code> : inline 16-bit word</li>
<li><code>InWA_</code> : inline 16-bit address</li>
<li><code>InNR_</code> : non-returning call (i.e. the JSR acts like
a JMP)</li>
</ul>
<p>Anything more complicated will require a custom script.</p>
<h3 id="script-dev">Script Development</h3>
<p>SourceGen defines an interface that plugins must implement, and an
interface that plugins can use to interact with SourceGen. See
Interfaces.cs in the PluginCommon directory.</p>
<p>The easiest way to develop extension scripts is inside the 6502bench
solution in Visual Studio. This way you have the interfaces available
for IntelliSense completion, and get all the usual syntax and compile
checking in the editor. (This is why there's a RuntimeData project for
Visual Studio.)</p>
<p>If you have the solution configured for debug builds, SourceGen will pass
<code>IncludeDebugInformation=true</code> to the script compiler. This
causes a .PDB file to be created. While this can help with debugging,
it can sometimes get in the way: if you edit the script source code and
reload the project without restarting the app, SourceGen will recompile
the script, but the old .PDB file will still be open by VisualStudio
and you'll see some failure messages. Exiting and restarting SourceGen
will allow regeneration of the PDB files.</p>
<p>Some commonly useful functions are defined in the
<code>PluginCommon.Util</code> class, which is available to plugins. These
call into the CommonUtil library, which is shared with SourceGen.
While plugins could technically use CommonUtil directly, they should avoid
doing so. The APIs there are not guaranteed to be stable, so plugins
that rely on them may break in a subsequent release of SourceGen.</p>
<h4>Known Issues and Limitations</h4>
<p>Scripts are currently limited to C# version 5, because the compiler
built into .NET only handles that. C# 6 and later require installing an
additional package ("Roslyn"), so SourceGen does not support this.</p>
<p>When a project is opened, any errors encountered by the script compiler
are reported to the user. If the project is already open, and a script
is added to the project through the Project Properties editor, compiler
messages are silently discarded. (This also applies if you undo/redo across
the property edit.) Use File &gt; Reload External Files to see the
compiler messages.</p>
<h4>PluginDllCache Directory</h4>
<p>Extension scripts are compiled into .DLLs, and saved in the PluginDllCache
directory, which lives next to the application executable and RuntimeData.
If the extension script is the same age or older than the DLL, SourceGen
will continue to use the existing DLL.</p>
<p>The DLL names are a combination of the script filename and script location.
The compiled name for "MyPlatform/MyScript.cs" in the RuntimeData directory
will be "RT_MyPlatform_MyScript.dll". For a project-specific script, it
would look like "PROJ_MyProject_MyScript.dll".</p>
<p>The PluginCommon and CommonUtil DLLs will be copied into the directory, so
that code in the sandbox has access to them.</p>
<p>The contents of the directory are generated as needed, and can be deleted
entirely whenever SourceGen isn't running.</p>
<h4>Sandboxing</h4>
<p>Extension scripts are executed in an App Domain sandbox. App domains are
a .NET feature that creates a partition inside the virtual machine, isolating
code. It still runs in the same address space, on the same threads, so the
isolation is only effective for "partially trusted" code that has been
declared safe by the bytecode verifier.</p>
<p>SourceGen disallows most actions, notably file access. An exception is
made for reading files from the directory where the plugin DLLs live, but
scripts are otherwise unable to read or write from the filesystem. (A
future version of SourceGen may provide an API that allows limited access
to data files.)</p>
<p>App domain security is not absolute. I don't really expect SourceGen to
be used as a malware vector, so there's no value in forcing scripts to
execute in an isolated server process, or to jump through the other hoops
required to really lock things down. I do believe there's value in
defining the API in such a way that we <b>could</b> implement full security if
circumstances change, so I'm using app domains as a way to keep the API
honest.</p>
<h2 id="multi-bin">Working With Multiple Binaries</h2>
<p>Sometimes a program is split into multiple files on disk. They
may be all loaded at once, or some may be loaded into the same place
at different times. In such situations it's not uncommon for one
file to provide a set of interfaces that other files use. It's
useful to have symbols for these interfaces be available to all
projects.</p>
<p>There are two ways to do this: (1) define a common platform symbol
file with the relevant addresses, and keep it up to date as you work;
or (2) declare the labels as global and exported, and import them
as project symbols into the other projects.</p>
<p>Support for this is currently somewhat weak, requiring a manual
symbol-import step in every interested project. This step must be
repeated whenever the labels are updated.</p>
<p>A different but related problem is typified by arcade ROM sets,
where files are split apart because each file must be burned into a
separate PROM. All files are expected to be present in memory at
once, so there's no reason to treat them as separate projects. Currently,
the best way to deal with this is to concatenate the files into a single
file, and operate on that.</p>
<h2 id="overlap">Overlapping Address Spaces</h2>
<p>Some programs use memory overlays, where multiple parts of the
code run in the same address in RAM. Others use bank switching to access
parts of the program that reside in separate physical RAM or ROM,
but appear at the same address. Nested address regions allow for a
variety of configurations, which can make address resolution complicated.</p>
<p>The general goal is to have references to an address resolve to
the "nearest" match. For example, consider a simple overlay:</p>
<pre>
.ADDRS $1000
JMP L1100
.ADDRS $1100
L1100 BIT L1100
L1103 LDA #$11
BRA L1103
.ADREND
.ADDRS $1100
L1100_0 BIT L1100_0
L1103_0 LDA #$22
JMP L1103_0
.ADREND
.ADREND
</pre>
<p>Both sections start at $1100, and have branches to $1103. The branch
in the first section resolves to the label in the first version of
that address chunk, while the branch in the second section resolves to
the label in the second chunk. When branches originate outside the current
address chunk, the first chunk that includes that address is used, as
it is with the <code>JMP $1000</code> at the start of the file.</p>
<p>The full address-to-offset algorithm is as follows.
There are two inputs: the file offset of the instruction or data item
that has the reference (e.g. the JMP or LDA), and the address
it is referring to.</p>
<ul>
<li>Create a tree with all address regions. Each "node" in the tree
has an offset, length, and start address.</li>
<li>Search the tree for a node that includes the offset of the
reference source.
When there are multiple overlapping regions, descend until the
deepest child that spans the offset is found. This node will be
the starting point of the search.</li>
<li>Loop until we hit the top of the tree:
<ul>
<li>Perform a recursive depth-first search of all children of the
current node. They're searched in order of ascending file offset.</li>
<li>If the address wasn't found in the children, check the current
node. If we find it here, return this node as the result.</li>
<li>Move up to the parent node.
</ul></li>
</ul>
<p>This searches all children and all siblings before checking the parent.
If we hit the top of the tree without finding a match, we conclude
that the reference is to an external address.</p>
<p>The tree search can be pruned with the
"disallow inbound address resolution" and
"disallow outbound address resolution" flags, which can be set in
the address region edit dialog
(<a href="intro-details.html#region-isolation">more info here</a>).
When inbound resolution is disabled,
parents and siblings will not search into a region. When outbound
resolution is disabled, the search will not ascend to the region's parent.
Note that neither option prevents descent into a region's children.</p>
<h2 id="reloc-data">OMF Relocation Dictionaries</h2>
<p><i>This feature is considered experimental. Some features,
like cross-reference tracking, may not work correctly with it.</i></p>
<p>65816 code can be tricky to disassemble for a number of reasons.
24-bit addresses are formed from 16-bit data-access operands by combining
with the Data Bank Register (DBR), which often requires a bit of manual
intervention. But the problems go beyond that. Consider the following
bit of source code for the Apple IIgs:</p>
<pre>
rsrcmsg pea rsrcmsg2|-16
pea rsrcmsg2
_WriteCString
lda #buffer
sta pArcRead+$04
lda #buffer|-16
sta pArcRead+$06
</pre>
<p>In both cases we're referencing a 24-bit address as two 16-bit values.
Without context, the disassembler will treat the PEA instruction as two
independent 16-bit addresses, and the immediate values as constants:</p>
<pre>
.dbank $02
02/317c: f4 02 00 L2317C pea L20002 &amp; $ffff
02/317f: f4 54 32 pea L23254 &amp; $ffff
02/3182: a2 0c 20 ldx #WriteCString
02/3185: 22 00 00 e1 jsl Toolbox
02/3189: a9 00 00 L23189 lda #$0000
02/318c: 8d 78 3f sta L23F78 &amp; $ffff
02/318f: a9 03 00 lda #$0003
02/3192: 8d 7a 3f sta L23F78 &amp; $ffff +2
</pre>
<p>Worse yet, those <code>STA</code> instruction operands would have been
shown as hex values or incorrect labels if the DBR had been set incorrectly.
However, if we have the relocation data, we know the full
address from which the addresses were formed, and we can tell when
immediate values are addresses rather than constants. And we can do this
even without setting the DBR.</p>
<pre>
02/317c: f4 02 00 L2317C pea L23254 &gt;&gt; 16
02/317f: f4 54 32 pea L23254 &amp; $ffff
02/3182: a2 0c 20 ldx #WriteCString
02/3185: 22 00 00 e1 jsl Toolbox
02/3189: a9 00 00 L23189 lda #L30000 &amp; $ffff
02/318c: 8d 78 3f sta L23F78 &amp; $ffff
02/318f: a9 03 00 lda #L30000 &gt;&gt; 16
02/3192: 8d 7a 3f sta L23F78 &amp; $ffff +2
</pre>
<p>The absence of relocation data can be a useful signal as well. For
example, when pushing arguments for a toolbox call, the disassembler
can tell the difference between addresses and constants without needing
emulation or pattern-matching, because only the addresses get
relocated. Consider this bit of source code:</p>
<pre>
lda &lt;total_records
pha
pea linebuf|-16
pea linebuf+65
pea $0005
pea $0000
_Int2Dec
</pre>
<p>Without relocation data, it becomes:</p>
<pre>
02/0aa8: a5 42 lda $42
02/0aaa: 48 pha
02/0aab: f4 02 00 pea L20002 &amp; $ffff
02/0aae: f4 03 31 pea L23103 &amp; $ffff
02/0ab1: f4 05 00 pea L20005 &amp; $ffff
02/0ab4: f4 00 00 pea L20000 &amp; $ffff
02/0ab7: a2 0b 26 ldx #Int2Dec
02/0aba: 22 00 00 e1 jsl Toolbox
</pre>
<p>If we treat the non-relocated operands as constants:</p>
<pre>
02/0aa8: a5 42 lda $42
02/0aaa: 48 pha
02/0aab: f4 02 00 pea L230C2 &gt;&gt; 16
02/0aae: f4 03 31 pea L23103 &amp; $ffff
02/0ab1: f4 05 00 pea $0005
02/0ab4: f4 00 00 pea $0000
02/0ab7: a2 0b 26 ldx #Int2Dec
02/0aba: 22 00 00 e1 jsl Toolbox
</pre>
<h2 id="debug">Debug Menu Options</h2>
<p>The DEBUG menu is hidden by default in release builds, but can be
exposed by checking the "enable DEBUG menu" box in the application
settings. These features are used for debugging SourceGen. They will
not help you debug 6502 projects.</p>
<p>Features:</p>
<ul>
<li>Re-analyze (<kbd class="key">F5</kbd>). Causes a full re-analysis.
Useful if you think the display is out of sync.</li>
<li>Source Generation Tests. Opens the regression test harness. See
<code>README.md</code> in the SGTestData directory for more information.
If the regression tests weren't included in the SourceGen distribution,
this will have nothing to do.</li>
<li>Show Analyzer Output. Opens a floating window with a text log from
the most recent analysis pass. The exact contents will vary depending
on how the verbosity level is configured internally. Debug messages
from extension scripts appear here.</li>
<li>Show Analysis Timers. Opens a floating window with a dump of
timer results from the most recent analysis pass. Times for individual
stages are noted, as are times for groups of functions. This
provides a crude sense of where time is being spent.</li>
<li>Show Undo/Redo History. Opens a floating window that lets you
watch the contents of the undo buffer while you work.</li>
<li>Extension Script Info. Shows a bit about the currently-loaded
extension scripts.</li>
<li>Show Comment Rulers. Adds a string of digits above every
multi-line comment (long comment, note). Useful for confirming that
the width limitation is being obeyed. These are added exactly
as shown, without comment delimiters, into generated assembly output,
which doesn't work out well if you run the assembler.</li>
<li>Disable Security Sandbox. Extension scripts are loaded and run in
a "sandbox" to prevent security issues. Setting this flag allows
them to execute with full permissions.
This setting is not persistent.</li>
<li>Disable Keep-Alive Hack. The hack sends a "ping" to the extension
script sandbox every 60 seconds. This seems to be required to avoid
an infrequently-encountered Windows bug. (See code for notes and
stackoverflow.com links.)
This setting is not persistent.</li>
<li>Reboot Security Sandbox. Discards the sandbox, creates a new one,
and reloads it. Only useful for exercising the sandbox code that
runs when the keep-alives are unsuccessful.</li>
<li>Applesoft to HTML. An experimental feature that formats an
Applesoft program as HTML.</li>
<li>Export Edit Commands. Outputs comments and notes in
SourceGen Edit Command format. This is an experimental feature.</li>
<li>Apply Edit Commands. Reads a file in SourceGen Edit Command
format and applies the commands.</li>
<li>Apply External Symbols. An experimental feature for turning platform
and project symbols into address labels. This will run through the list
of all symbols loaded from .sym65 files and find addresses that fall
within the bounds of the file. If it finds an address that is the start
of a code/data line and doesn't already have a user-supplied label,
and the platform symbol's label isn't already defined elsewhere, the
platform label will be applied. Useful when disassembling ROM images
or other code with an established set of public entry points.
(Tip: disable "analyze uncategorized data" from the project
properties editor first, as this will not set labels in the middle
of multi-byte data items.)</li>
</ul>
</div>
<div id="footer">
<p><a href="index.html">Back to index</a></p>
</div>
</body>
<!-- Copyright 2018 faddenSoft -->
</html>