About Disassembly

Well-written assembly-language source code has meaningful comments and labels, so that humans can read and understand it. For example:

          .org  $2000
          sec                         ;set carry
          ror   A                     ;shift into high bit
          bmi   CopyData              ;branch always

          .asciiz "first string"
          .asciiz "another string"
          .asciiz "string the third"
          .asciiz "last string"

CopyData  lda   #<addrs               ;get pointer into
          sta   ptr                   ; address table
          lda   #>addrs
          sta   ptr+1

Computers operate at a much lower level, so a piece of software called an assembler is used to convert the source code to object code that the CPU can execute. Object code looks more like this:

38 6a 30 39 66 69 72 73 74 20 73 74 72 69 6e 67
00 61 6e 6f 74 68 65 72 20 73 74 72 69 6e 67 00
73 74 72 69 6e 67 20 74 68 65 20 74 68 69 72 64
00 6c 61 73 74 20 73 74 72 69 6e 67 00 a9 63 85
02 a9 20 85 03

This arrangement works perfectly well until somebody needs to modify the software and nobody can find the original sources. Disassembly is the act of taking a raw hex dump and converting it to source code.

t0-bad-disasm

Disassembling a blob of data can be tricky. A simple disassembler can format instructions, but can't generally tell the difference between instructions and data. Many 6502 programs intermix code and data freely, so simply dumping everything as an instruction stream can result in sections with nonsensical output.

One way to separate code from data is to try to execute all possible data paths. There are a number of reasons why it's difficult or impossible to do this perfectly, but you can get pretty good results by identifying execution entry points and just walking through the code. When a conditional branch is encountered, both paths are traversed. When all code has been traced, every byte that hasn't been visited is either data used by the program, or dead space not used by anything.

The process can be improved by keeping track of the flags in the 6502 status register. For example, in the code fragment shown earlier, BMI conditional branch instruction is used. A simple tracing algorithm would both follow the branch and fall through to the following instruction. However, the code that precedes the BMI ensures that the branch is always taken, so a clever disassembler would only trace that path.

(The situation is worse on the 65816, because the length of certain instructions is determined by the values of the processor status flags.)

Once the instructions and data are separated and formatted nicely, it's still up to a human to figure out what it all means. Comments and meaningful labels are needed to make sense of it. These should be added to the disassembly listing.

t0-sourcegen

SourceGen performs the instruction tracing, and makes it easy to format operands and add labels and comments. When the disassembled code is ready, SourceGen can generate source code for a variety of modern cross-assemblers, and produce HTML listings with embedded graphic visualizations.

« Previous Next »