Allows the odd case to be slightly more efficient and the
code is cleaned up by not having to handle both even and odd
alignment cases at multiple points.
Find small optimizations to improve the average performance of the
blitter, especially in the odd-aligned case.
- Odd-aligned PEA exit is 2 cycles faster per line
- Odd-aligned JMP exit is 2 cycles faster per line
- Odd-aligned LDA exit is 6 cycles faster (eliminated long store)
- Merged setting the entry opcode and offset to convert 2 8-bit
store into a single 16-bit store (save 6 cycles per line)
- Load and save the full word for the high bytes. Cost 2 cycles
but enabled the 6 cycles saved for the LDA case.
Eliminates the JSR/RTS overhead for the copy functions. Combined
with the other streamlining, we save around 60 - 70 cycles per
bank, or a total savings of around 10,000 cycles per seconds when
running at full screen.
This doesn't really change the FPS, but just gives some cycles
back to the main application logic.
The core data tables were reworked to pre-reverse all of the
entries to directly match the right-to-left ordering of the code
fields. This simplified some code but was required for register
reuse in the masked tile renderer.
Also fixed several offset calculation issues in the masked tile
renderer.