Find small optimizations to improve the average performance of the
blitter, especially in the odd-aligned case.
- Odd-aligned PEA exit is 2 cycles faster per line
- Odd-aligned JMP exit is 2 cycles faster per line
- Odd-aligned LDA exit is 6 cycles faster (eliminated long store)
- Merged setting the entry opcode and offset to convert 2 8-bit
store into a single 16-bit store (save 6 cycles per line)
- Load and save the full word for the high bytes. Cost 2 cycles
but enabled the 6 cycles saved for the LDA case.
* Split the creation of the sprite stamps from adding the
sprites themselves. This allows for 48 stamps that can
be pre-rendered and quickly reassigned to sprites for
animations.
* Inlined all calls to PushDirtyTile. This both removed
significant overhead from calling the small function and,
since almost all callers we checking multiple tiles, we
were able to avoid incrementing the count each time and
just add a single incrments at the end.
* Switched from recording each tile that a sprite intersects
with each from to only recording the top-left tile and the
overlap size. This reduced overhead for larger sprites
and removed the needs for an end-of-list marker.
* Much more aggressive caching of Sprite and Tile Store
values in order to streamline the inner tile dispatch
routines.
* Moving TileStore and Sprites (and other supporting
data structures) into a separate data bank. Needed just
for size purposes and provide micro-optimizations by
opening up the use of abs,y addressing modes.
* Revamped multi-sprite rendering code to avoid the need to
copy any masks and all stacked sprites can be drawn
via a sequence of and [addrX],y; ora (addrX),y where
addrX is set once per tile.
* General streamlining to reduct overhead. This work was
focused on removing as much per-tile overhead as possible.