About Disassembly
Well-written assembly-language source code has meaningful comments and labels, so that humans can read and understand it. For example:
.org $2000 sec ;set carry ror A ;shift into high bit bmi CopyData ;branch always .asciiz "first string" .asciiz "another string" .asciiz "string the third" .asciiz "last string" CopyData lda #<addrs ;get pointer into sta ptr ; address table lda #>addrs sta ptr+1
Computers operate at a much lower level, so a piece of software called an assembler is used to convert the source code to object code that the CPU can execute. Object code looks more like this:
38 6a 30 39 66 69 72 73 74 20 73 74 72 69 6e 67 00 61 6e 6f 74 68 65 72 20 73 74 72 69 6e 67 00 73 74 72 69 6e 67 20 74 68 65 20 74 68 69 72 64 00 6c 61 73 74 20 73 74 72 69 6e 67 00 a9 63 85 02 a9 20 85 03
This arrangement works perfectly well until somebody needs to modify the software and nobody can find the original sources. Disassembly is the act of taking a raw hex dump and converting it to source code.
Disassembling a blob of data can be tricky. A simple disassembler can format instructions, but can't generally tell the difference between instructions and data. Many 6502 programs intermix code and data freely, so simply dumping everything as an instruction stream can result in sections with nonsensical output.
One way to separate code from data is to try to execute all possible data paths. There are a number of reasons why it's difficult or impossible to do this perfectly, but you can get pretty good results by identifying execution entry points and just walking through the code. When a conditional branch is encountered, both paths are traversed. When all code has been traced, every byte that hasn't been visited is either data used by the program, or dead space not used by anything.
The process can be improved by keeping track of the flags in the
6502 status register. For example, in the code fragment shown
earlier, BMI
conditional branch instruction is used.
A simple tracing algorithm would both follow the branch and fall
through to the following instruction. However, the code that precedes
the BMI
ensures that the branch is always taken, so a
clever disassembler would only trace that path.
(The situation is worse on the 65816, because the length of certain instructions is determined by the values of the processor status flags.)
Once the instructions and data are separated and formatted nicely, it's still up to a human to figure out what it all means. Comments and meaningful labels are needed to make sense of it. These should be added to the disassembly listing.
SourceGen performs the instruction tracing, and makes it easy to format operands and add labels and comments. When the disassembled code is ready, SourceGen can generate source code for a variety of modern cross-assemblers, and produce HTML listings with embedded graphic visualizations.