iigs-game-engine/src/GTE.Line.s

224 lines
10 KiB
ArmAsm

; Template and utility function for a single line of the GTE blitter. This is a memory
; hog. Because, potentially, all of the registers (X, Y, D, SP, B) are in use, we only
; have the P-register and PC for flow control. A lot of the code is replicated so that
; different piece of code run at different address so that JMP instruction can be used
; for flow control.
;
; Any JMP instruction with an address of $20XX will have its low byte set when the
; scroll position of screen is set. If the scroll position is set, repeately calling
; the blitter will refresh the screen each time without re-applying all of the patches
; that depend on the scroll position.
;
; When called
; * Interrupts are off
; * Bank address is set to background 1
; * Direct Page is set to the data field
; + First 256 bytes are pointers to background 1 mask data
; + Next 2048 bytes are dynamic tile data
; * Bank 00 read
; * Bank 01 write
;
; Each line takes up 8kb and is aligned to a multiple of $2000
; NOTE: Each line may only need 4kb -- space requirements driven by snippet complexity
;
; Each 3-byte sequence in the code field is one of
;
; PEA $0000 => F4 00 00 = %1111 0100
; LDA 00 / PHA => A5 00 48 = %1010 0101
; LDA 00,x / PHA => B5 00 48 = %1011 0101
; LDA (00),y / PHA => B1 00 48 = %1011 0001
; LDA 00,s / PHA => A3 00 48 = %1010 0011
; JMP 0000 => 4C 00 00 = %0100 1100
;
; Only the JMP opcode is less than $80 and all of the others perform their work inline,
; so this gives us a fast test to help extract only the high or low byte of each word
;
; So, before diving into the code, just how fast is it? The architecture of GTE allows simple
; things to be fast and complex things to not be slow. That is, the developer has a lot
; of control over the time taken to render the full screen based on how complex it can be.
;
; That said, I'll cover three cases, ranging from the simple (a single background) to the
; complex (2 backgrounds, 50% mixed). The even- and odd-aligned cases are also broken out.
;
; Simple case; all elements of the code field are PEA instructions
;
; Even:
; - Start at entry_3, 8 cycles to jump into the code field
; - 80 PEA instructions + one JMP = 403 cycles
; - BRA and JMP = 6 cycles
; - Final JMP to next line = 3 cycles
; -- total of 420 cycles / line of which 400 were spent doing necessary instructions
; -- theoretically almost 30 fps
;
; Odd:
; - Start at entry_3, 17 cycles to get to r_is_pea. If a second background is never used,
; this template could be specialized and reduce this overhead to 11 cycles.
; - 15 cycles to push the 8-bit right edge
; - 78 PEA instructions + one JMP = 393 cycles
; - 50% JMP to odd_exit = 1.5 cycles, amortized
; - 24 cycles to push the 8-bit left edge
; - Final JMP to next line = 3 cycles
; -- total 453.5 cycles / line
; -- theoretically 27.5 fps
;
; Complex; 25% of code-field is PEA, 25% is LDA (00),y / PHA, and 50% is mixed
;
; Even:
; - Start at entry_3, 8 cycles to jump into the code field
; - Code Field
; - 20 PEA instruction = 100 cycles
; - 20 LDA (00),y / PHA = 240 cycles
; - 20 JMP / Fast Path = 1040 cycles
; - JMP loop = 3 cycles
; - BRA and JMP = 6 cycles
; - Final JMP to next line = 3 cycles
; -- total of 1,517 cycles / line of which 700 were spent doing necessary instructions
; -- theoretically about 8 fps
;
; Odd:
MX %00
entry_1 ldx #0000 ; patch with the address of the direct page tiles. Fixed.
entry_2 ldy #0000 ; patch with the address of the line in the second layer. Set when BG1 scroll position changes.
entry_3 lda #0000 ; patch with the address of the right edge of the line. Set when origin position changes.
tcs
entry_jmp jmp $2000
dfb 00 ; if the screen is odd-aligned, then the opcode is set to
; ; $AF to convert to a LDA long instruction. This puts the
; ; first two bytes of the instruction field in the accumulator
; ; and falls through to the next instruction.
;
; ; We structure the line so that the entry point only needs to
; ; update the low-byte of the address, the means it takes only
; ; an amortized 4-cycles per line to set the entry pointbra
right_odd bit #$000B ; Check the bottom nibble to quickly identify a PEA instruction
beq r_is_pea ; This costs 6 cycles in the fast-path
bit #$0040 ; Check bit 6 to distinguish between JMP and all of the LDA variants
bne r_is_jmp
stal r_lda_patch+1 ; Original word is still in the accumulator. Execute it. We inline
r_lda_patch dfb 00,00 ; this here to avoid needing a BRA instruction back. So the fast-path
; ; gets a 1-cycle penalty, but we save 3 cycles here.
r_is_pea xba ; fast code for PEA
sep #$30
pha
rep #$30
jmp $2003 ; unconditionally jump into the "next" instruction in the
; ; code field. This is OK, even if the entry point was the
; ; last instruction, because there is a JMP at the end of
; ; the code field, so the code will simply jump to that
; ; instruction directly.
; ;
; ; As with the original entry point, because all of the
; ; code field is page-aligned, only the low byte needs to
; ; be updated when the scroll position changes
r_is_jmp sep #$41 ; Set the C and V flags which tells a snippet to push only the low byte
ldal entry_jmp+1
stal r_jmp_patch+1
r_jmp_patch dfb $4C,$00,$00 ; Jump back to address in entry_jmp (this takes 16 cycles, is there a better way?)
; This is the spot that needs to be page-aligned. In addition to simplifying the entry address
; and only needing to update a byte instad of a word, because the code breaks out of the
; code field with a BRA instruction, we keep everything within a page to avoid the 1-cycle
; page-crossing penalty of the branch.
jmp odd_exit ; +0 Alternate exit point depending on whether the left edge is
jmp even_exit ; +3 odd-aligned
loop lup 82 ; +6 Set up 82 PEA instructions, which is 328 pixels and consumes 246 bytes
pea $0000 ; This is 41 8x8 tiles in width. Need to have N+1 tiles for screen overlap
--^
jmp loop ; +252 Ensure execution continues to loop around
jmp even_exit ; +255
odd_exit lda #0000 ; This operand field is *always* used to hold the original 2 bytes of the code field
; ; that are replaced by the needed BRA instruction to exit the code field. When the
; ; left edge is odd-aligned, we are able to immediately load the value and perform
; ; similar logic to the right_odd code path above
left_odd bit #$000B
beq l_is_pea
bit #$0040
bne l_is_jmp
stal l_lda_patch+1
l_lda_patch dfb 00,00
l_is_pea xba
sep #$30
pha
rep #$30
bra even_exit
l_is_jmp sep #$01 ; Set the C flag (V is always cleared at this point) which tells a snippet to push only the high byte
ldal entry_jmp+1
stal l_jmp_patch+1
l_jmp_patch dfb $4C,$00,$00 ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)
even_exit jmp $0000 ; Jump to the next line. We set up the blitter to do 8 or 16 lines at a time
; ; before restoring the machine state and re-enabling interrupts. This makes
; ; the blitter interrupt friendly to allow things like music player to continue
; ; to function.
;
; ; When it's time to exit, the next_entry address points to an alternate exit point
; These are the special code snippets -- there is a 1:1 relationship between each snippet space
; and a 3-byte entry in the code field. Thus, each snippet has a hard-coded JMP to return to
; the next code field location
;
; The snippet is required to handle the odd-alignment in-line; there is no facility for
; patching or intercepting these values due to their complexity. The only requirements
; are:
;
; 1. Carry Clear -> 16-bit write and return to the next code field operand
; 2. Carry Set
; a. Overflow set -> Low 8-bit write and return to the next code field operand
; b. Overflow clear -> High 8-bit write and exit the line
; c. Always clear the Carry flags. It's actually OK to leave the overflow bit in
; its passed state, because having the carry bit clear prevent evaluation of
; the V bit.
;
; Snippet Samples:
;
; Standard Two-level Mix (27 bytes)
;
; Optimal = 18 cycles (LDA/AND/ORA/PHA)
; 16-bit write = 23 cycles
; 8-bit low = 35 cycles
; 8-bit high = 36 cycles
;
; start lda (00),y
; and #MASK
; ora #DATA ; 14 cycles to load the data
; bcs 8_bit
; pha
; out jmp next ; Fast-path completes in 9 additional cycles
; 8_bit sep #$30 ; Switch to 8 bit mode
; bvs r_edge ; Need to switch if doing the left edge
; xba
; r_edge pha ; push the value
; rep #$31 ; put back into 16-bit mode and clear the carry bit, as required
; bvs out ; jmp out and continue if this is the right edge
; jmp even_exit ; exit the line otherwise
; ;
; ; The slow paths have 21 and 22 cycles for the right and left
; ; odd-aligned cases respectively.
snippets ds 32*82