6502bench SourceGen: Advanced Topics

Platform Symbol Files (.sym65)

Platform symbol files contain lists of symbols, each of which has a label and a value. SourceGen comes with a collection of symbols for popular systems, but you can create your own. This can be handy if a few different projects are coded against a common library.

If two symbols have the same value, the older symbol is replaced by the newer one. This is why the order in which symbol files are loaded matters.

Platform symbol files consist of comments, commands, and symbols. Blank lines, and lines that begin with a semicolon (';'), are ignored. Lines that begin with an asterisk ('*') are commands. Three are currently defined:

*SYNOPSIS - a short summary of the file contents.
*TAG - a tag string to apply to all symbols that follow in this file.
*MULTI_MASK - specify a mask for symbols that appear at multiple addresses.

Tags can be used by extension scripts to identify a subset of symbols. The symbols are still part of the global set; the tag just provides a way to extract a subset. Tags should be comprised of non-whitespace ASCII characters. Tags are global, so use a long, descriptive string. If *TAG is not followed by a string, the symbols that follow are treated as untagged.

All other lines are symbols, which have the form:

  LABEL {=|@|<|>} VALUE [WIDTH] [;COMMENT]

The LABEL must be at least two characters long, begin with a letter or underscore, and consist entirely of alphanumeric ASCII characters (A-Z, a-z, 0-9) and the underscore ('_'). (This is the same format required for line labels in SourceGen.)

The next token can be one of:

@: general addresses
<: read-only addresses
>: write-only addresses
=: constants

If an instruction references an address, and that address is outside the bounds of the file, the list of address symbols (i.e. everything that's not a constant) will be scanned for a match. If found, the symbol is applied automatically. You normally want to use '@', but can use '<' and '>' for memory-mapped I/O locations that have different behavior depending on whether they are read or written.

The VALUE is a number in decimal, hexadecimal (with a leading '$'), or binary (with a leading '%'). The numeric base will be recorded and used when formatting the symbol in generated output, so use whichever form is most appropriate. Values are unsigned 24-bit numbers. The special value "erase" may be used for an address to erase a symbol defined in an earlier platform file.

The WIDTH is optional, and ignored for constants. It must be a decimal or hexadecimal value between 1 and 65536, inclusive. If omitted, the default width is 1.

The COMMENT is optional. If present, it will be saved and used as the end-of-line comment on the .EQ directive if the symbol is used.

Using MULTI_MASK

The multi-address mask is used for systems like the Atari 2600, where RAM, ROM, and I/O registers appear at multiple addresses. The hardware looks for certain address lines to be set or clear, and if the pattern matches, another set of bits is examined to determine which register or RAM address is being accessed.

This is expressed in symbol files with the MULTI_MASK statement. Address symbol declarations that follow have the mask set applied. Symbols whose addresses don't fit the pattern cause a warning and will be ignored. Constants are not affected.

The mask set is best explained with an example. Suppose the address pattern for a set of registers is ???0 ??1? 1??x xxxx (where '?' can be any value, 0/1 must be that value, and 'x' means the bit is used to determine the register). So any address between $0280-029F matches, as does $23C0-23DF, but $0480 and $1280 don't. The register number is found in the low five bits.

The corresponding MULTI_MASK line, with values specifed in binary, would be:

  *MULTI_MASK %0001001010000000 %0000001010000000 %0000000000011111

The values are CompareMask, CompareValue, and AddressMask. To determine if an address is in the register set, we check to see if (address & CompareMask) == CompareValue. If so, we can extract the register number with (address & AddressMask).

We don't want to have a huge collection of equates at the top of the generated source file, so whatever value is used in the symbol declaration is considered the "canonical" value. All other matching values are output with an offset.

All mask values must fall between 0 and $00FFFFFF. The set bits in CompareMask and AddressMask must not overlap, and CompareValue must not have any bits set that aren't also set in CompareMask.

If an address can be mapped to a masked value and an unmasked value, the unmasked value takes precedence for exact matches. In the example above, if you declare REG1 @ $0281 outside the MULTI_MASK declaration, the disassembler will use REG1 for all operands that reference $0281. If other code accesses the same register as $23C1, the symbol established for the masked value will be used instead.

If there are multiple masked values for a given address, the precedence is undefined.

To disable the MULTI_MASK and resume normal declarations, write the tag without arguments:

  *MULTI_MASK

Creating a Project-Specific Symbol File

To create a platform symbol file for your project, just create a new text file, named with a ".sym65" extension. (If your text editor of choice doesn't like that, you can put a ".txt" on the end while you're editing.) Make sure you create it in the same directory where your project file (the file that ends with ".dis65") lives. Add a *SYNOPSIS, then add the desired symbols.

Finally, add it to your project. Select Edit > Project Properties, switch to the Symbol Files tab, click Add Symbol Files from Project, and select your symbol file. It should appear in the list with a "PROJ:" prefix.

If an example helps, the A2-Amper-fdraw project in the Examples directory has a project-local symbol file, called "fdraw-exports". (fdraw-exports is a list of exported symbols from the fdraw library, for which Amper-fdraw provides an Applesoft BASIC interface.)

NOTE: in the current version of SourceGen, changes to .sym65 files are not detected automatically. Closing and re-opening the project (File > Recent Projects, then select the first entry) will reload them.

Extension Scripts

Extension scripts, also called "plugins", are C# programs with access to the full .NET Standard 2.0 APIs. They're compiled at run time by SourceGen and executed in a sandbox with security restrictions.

SourceGen defines an interface that plugins must implement, and an interface that plugins can use to interact with SourceGen. See Interfaces.cs in the PluginCommon directory. Bear in mind that this feature is still evolving, and the interfaces may change significantly in the near future.

The current interfaces can be used to generate visualizations, to identify inline data that follows JSR, JSL, or BRK instructions, and to format operands. The latter can be useful for replacing immediate load operands with symbolic constants.

Scripts may be loaded from the RuntimeData directory, or from the directory where the project file lives. Attempts to load them from other locations will fail.

A project may load multiple scripts. The order in which they are invoked is not defined.

Known Issues and Limitations

When a project is opened, any errors encountered by the script compiler are reported to the user. If the project is already open, and a script is added to the project through the Project Properties editor, compiler messages are silently discarded. (This also applies if you undo/redo across the property edit.)

Development

The easiest way to develop extension scripts is inside the 6502bench solution in Visual Studio. This way you have the interfaces available for IntelliSense completion, and get all the usual syntax and compile checking in the editor. (This is why there's a RuntimeData project for Visual Studio.)

If you have the solution configured for debug builds, SourceGen will pass IncludeDebugInformation=true to the script compiler. This causes a .PDB file to be created. While this can help with debugging, it can sometimes get in the way: if you edit the script source code and reload the project without restarting the app, SourceGen will recompile the script, but the old .PDB file will still be open by VisualStudio and you'll get error messages.

Some commonly useful functions are defined in the PluginCommon.Util class, which is available to plugins. These call into the CommonUtil library, which is shared with SourceGen. While plugins can use CommonUtil directly, they should avoid doing so. The APIs there are not guaranteed to be stable, so plugins that rely on them may break in a subsequent release of SourceGen.

PluginDll Directory

Extension scripts are compiled into .DLLs, and saved in the PluginDll directory, which lives next to the application executable and RuntimeData. If the extension script is the same age or older than the DLL, SourceGen will continue to use the existing DLL.

The DLLs names are a combination of the script filename and script location. The compiled name for "MyPlatform/MyScript.cs" in the RuntimeData directory will be "RT_MyPlatform_MyScript.dll". For a project-specific script, it would look like "PROJ_MyProject_MyScript.dll".

The PluginCommon and CommonUtil DLLs will be copied into the directory, so that code in the sandbox has access to them.

Sandboxing

Extension scripts are executed in an App Domain sandbox. App domains are a .NET feature that creates a partition inside the virtual machine, isolating code. It still runs in the same address space, on the same threads, so the isolation is only effective for "partially trusted" code that has been declared safe by the bytecode verifier.

SourceGen disallows most actions, notably file access. An exception is made for reading files from the directory where the plugin DLLs live, but scripts are otherwise unable to read or write from the filesystem. (A future version of SourceGen may provide an API that allows limited access to data files.)

App domain security is not absolute. I don't really expect SourceGen to be used as a malware vector, so there's no value in forcing scripts to execute in an isolated server process, or to jump through the other hoops required to really lock things down. I do believe there's value in defining the API in such a way that we could implement full security if circumstances change, so I'm using app domains as a way to keep the API honest.

Working With Multiple Binaries

Sometimes a program is split into multiple files on disk. They may be all loaded at once, or some may be loaded into the same place at different times. In such situations it's not uncommon for one file to provide a set of interfaces that other files use. It's useful to have symbols for these interfaces be available to all projects.

There are two ways to do this: (1) define a common platform symbol file with the relevant addresses, and keep it up to date as you work; or (2) declare the labels as global and exported, and import them as project symbols into the other projects.

Support for this is currently somewhat weak, requiring a manual symbol-import step in every interested project. This step must be repeated whenever the labels are updated.

A different but related problem is typified by arcade ROM sets, where files are split apart because each file must be burned into a separate PROM. All files are expected to be present in memory at once, so there's no reason to treat them as separate projects. Currently, the best way to deal with this is to concatenate the files into a single file, and operate on that.

Overlapping Address Spaces

Some programs use memory overlays, where multiple parts of the code run in the same address in RAM. Others use bank switching to access parts of the program that reside in separate physical RAM, but appear at the same address.

SourceGen allows you to set the same address on multiple parts of a file. Branches to a given address are resolved against the current segment first. For example, consider this:

         .ORG    $1000
         JMP     L1100

         .ORG    $1100
L1100    BIT     L1100
L1103    LDA     #$11
         BRA     L1103

         .ORG    $1100
L1100_0  BIT     L1100_0
L1103_0  LDA     #$22
         JMP     L1103_0

Both sections start at $1100, and have branches to $1103. The branch in the first section resolves to the label in the first version of that address chunk, while the branch in the second section resolves to the label in the second chunk. When branches originate outside the current address chunk, the first chunk that includes that address is used, as it is with the JMP $1000 at the start of the file.

Debug Menu Options

The DEBUG menu is hidden by default in release builds, but can be exposed by checking the "enable DEBUG menu" box in the application settings. These features are used for debugging SourceGen. They will not help you debug 6502 projects.

Features:

Re-analyze (F5). Causes a full re-analysis. Useful if you think the display is out of sync.
Source Generation Tests. Opens the regression test harness. See README.md in the SGTestData directory for more information. If the regression tests weren't included in the SourceGen distribution, this will have nothing to do.
Show Analyzer Output. Opens a floating window with a text log from the most recent analysis pass. The exact contents will vary depending on how the verbosity level is configured internally. Debug messages from extension scripts appear here.
Show Analysis Timers. Opens a floating window with a dump of timer results from the most recent analysis pass. Times for individual stages are noted, as are times for groups of functions. This provides a crude sense of where time is being spent.
Show Undo/Redo History. Opens a floating window that lets you watch the contents of the undo buffer while you work.
Extension Script Info. Shows a bit about the currently-loaded extension scripts.
Show Comment Rulers. Adds a string of digits above every multi-line comment (long comment, note). Useful for confirming that the width limitation is being obeyed. These are added exactly as shown, without comment delimiters, into generated assembly output, which doesn't work out well if you run the assembler.
Use Keep-Alive Hack. If set, a "ping" is sent to the extension script sandbox every 60 seconds. This seems to be required to avoid an infrequently-encountered Windows bug. (See code for notes and stackoverflow.com links.)
Applesoft to HTML. An experimental feature that formats an Applesoft program as HTML.
Apply Platform Symbols. An experimental feature for turning platform symbols into address labels. This will run through the list of all symbols loaded from .sym65 files and find addresses that fall within the bounds of the file. If it finds an address that is the start of a code/data line and doesn't already have a user-supplied label, and the platform symbol's label isn't already defined elsewhere, the platform label will be applied. Useful when disassembling ROM images or other code with an established set of public entry points. (Tip: disable "analyze uncategorized data" from the project properties editor first.)