; Template and utility function for a single line of the GTE blitter. This is a memory
; hog. Because, potentially, all of the registers (X, Y, D, SP, B) are in use, we only
; have the P-register and PC for flow control. A lot of the code is replicated so that
; different piece of code run at different address so that JMP instruction can be used
; for flow control.
;
; Any JMP instruction with an address of $20XX will have its low byte set when the 
; scroll position of screen is set.  If the scroll position is set, repeately calling
; the blitter will refresh the screen each time without re-applying all of the patches
; that depend on the scroll position.
;
; When called
;  * Interrupts are off
;  * Bank address is set to background 1
;  * Direct Page is set to the data field
;    + First 256 bytes are pointers to background 1 mask data
;    + Next 2048 bytes are dynamic tile data
;  * Bank 00 read
;  * Bank 01 write
;
; Each line takes up 8kb and is aligned to a multiple of $2000
;  NOTE: Each line may only need 4kb -- space requirements driven by snippet complexity
;
; Each 3-byte sequence in the code field is one of
;
;  PEA $0000         => F4 00 00 = %1111 0100
;  LDA 00     / PHA  => A5 00 48 = %1010 0101
;  LDA 00,x   / PHA  => B5 00 48 = %1011 0101
;  LDA (00),y / PHA  => B1 00 48 = %1011 0001
;  LDA 00,s   / PHA  => A3 00 48 = %1010 0011
;  JMP 0000          => 4C 00 00 = %0100 1100
;
; Only the JMP opcode is less than $80 and all of the others perform their work inline,
; so this gives us a fast test to help extract only the high or low byte of each word
;
; So, before diving into the code, just how fast is it? The architecture of GTE allows simple 
; things to be fast and complex things to not be slow.  That is, the developer has a lot
; of control over the time taken to render the full screen based on how complex it can be.
;
; That said, I'll cover three cases, ranging from the simple (a single background) to the
; complex (2 backgrounds, 50% mixed).  The even- and odd-aligned cases are also broken out.
;
; Simple case; all elements of the code field are PEA instructions
;
;  Even:
;    - Start at entry_3, 8 cycles to jump into the code field
;    - 80 PEA instructions + one JMP = 403 cycles
;    - BRA and JMP = 6 cycles
;    - Final JMP to next line = 3 cycles
;      -- total of 420 cycles / line of which 400 were spent doing necessary instructions
;      -- theoretically almost 30 fps
;
;  Odd:
;    - Start at entry_3, 17 cycles to get to r_is_pea. If a second background is never used,
;      this template could be specialized and reduce this overhead to 11 cycles.
;    - 15 cycles to push the 8-bit right edge
;    - 78 PEA instructions + one JMP = 393 cycles
;    - 50% JMP to odd_exit = 1.5 cycles, amortized
;    - 24 cycles to push the 8-bit left edge
;    - Final JMP to next line = 3 cycles
;      -- total 453.5 cycles / line
;      -- theoretically 27.5 fps
;
; Complex; 25% of code-field is PEA, 25% is LDA (00),y / PHA, and 50% is mixed
;
;  Even:
;    - Start at entry_3, 8 cycles to jump into the code field
;    - Code Field
;      - 20 PEA instruction  =  100 cycles
;      - 20 LDA (00),y / PHA =  240 cycles
;      - 20 JMP / Fast Path  = 1040 cycles
;      - JMP loop            =    3 cycles
;    - BRA and JMP = 6 cycles
;    - Final JMP to next line = 3 cycles
;      -- total of 1,517 cycles / line of which 700 were spent doing necessary instructions
;      -- theoretically about 8 fps
;
;  Odd:
             MX    %00
entry_1      ldx   #0000          ; patch with the address of the direct page tiles. Fixed.
entry_2      ldy   #0000          ; patch with the address of the line in the second layer. Set when BG1 scroll position changes.
entry_3      lda   #0000          ; patch with the address of the right edge of the line. Set when origin position changes.
             tcs

entry_jmp    jmp   $2000
             dfb   00             ; if the screen is odd-aligned, then the opcode is set to 
;                                 ; $AF to convert to a LDA long instruction.  This puts the
;                                 ; first two bytes of the instruction field in the accumulator
;                                 ; and falls through to the next instruction.
;
;                                 ; We structure the line so that the entry point only needs to
;                                 ; update the low-byte of the address, the means it takes only
;                                 ; an amortized 4-cycles per line to set the entry pointbra

right_odd    bit   #$000B         ; Check the bottom nibble to quickly identify a PEA instruction
             beq   r_is_pea       ; This costs 6 cycles in the fast-path

             bit   #$0040         ; Check bit 6 to distinguish between JMP and all of the LDA variants
             bne   r_is_jmp

             stal  r_lda_patch+1  ; Original word is still in the accumulator.  Execute it. We inline 
r_lda_patch  dfb   00,00          ; this here to avoid needing a BRA instruction back.  So the fast-path
;                                 ; gets a 1-cycle penalty, but we save 3 cycles here.

r_is_pea     xba                  ; fast code for PEA
             sep   #$30
             pha
             rep   #$30
             jmp   $2003          ; unconditionally jump into the "next" instruction in the 
;                                 ; code field.  This is OK, even if the entry point was the
;                                 ; last instruction, because there is a JMP at the end of
;                                 ; the code field, so the code will simply jump to that
;                                 ; instruction directly.
;                                 ;
;                                 ; As with the original entry point, because all of the
;                                 ; code field is page-aligned, only the low byte needs to
;                                 ; be updated when the scroll position changes

r_is_jmp     sep   #$41           ; Set the C and V flags which tells a snippet to push only the low byte
             ldal  entry_jmp+1
             stal  r_jmp_patch+1
r_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 16 cycles, is there a better way?)

; This is the spot that needs to be page-aligned. In addition to simplifying the entry address
; and only needing to update a byte instad of a word, because the code breaks out of the
; code field with a BRA instruction, we keep everything within a page to avoid the 1-cycle
; page-crossing penalty of the branch.
             jmp   odd_exit       ; +0   Alternate exit point depending on whether the left edge is 
             jmp   even_exit      ; +3   odd-aligned

loop         lup   82             ; +6   Set up 82 PEA instructions, which is 328 pixels and consumes 246 bytes
             pea   $0000          ;      This is 41 8x8 tiles in width.  Need to have N+1 tiles for screen overlap
             --^
             jmp   loop           ; +252 Ensure execution continues to loop around
             jmp   even_exit      ; +255

odd_exit     lda   #0000          ; This operand field is *always* used to hold the original 2 bytes of the code field
;                                 ; that are replaced by the needed BRA instruction to exit the code field.  When the
;                                 ; left edge is odd-aligned, we are able to immediately load the value and perform
;                                 ; similar logic to the right_odd code path above

left_odd     bit   #$000B
             beq   l_is_pea

             bit   #$0040
             bne   l_is_jmp

             stal  l_lda_patch+1
l_lda_patch  dfb   00,00
l_is_pea     xba
             sep   #$30
             pha
             rep   #$30
             bra   even_exit
l_is_jmp     sep   #$01           ; Set the C flag (V is always cleared at this point) which tells a snippet to push only the high byte
             ldal  entry_jmp+1
             stal  l_jmp_patch+1
l_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)

even_exit    jmp   $0000          ; Jump to the next line.  We set up the blitter to do 8 or 16 lines at a time
;                                 ; before restoring the machine state and re-enabling interrupts.  This makes
;                                 ; the blitter interrupt friendly to allow things like music player to continue
;                                 ; to function.
;
;                                 ; When it's time to exit, the next_entry address points to an alternate exit point

; These are the special code snippets -- there is a 1:1 relationship between each snippet space
; and a 3-byte entry in the code field. Thus, each snippet has a hard-coded JMP to return to 
; the next code field location
;
; The snippet is required to handle the odd-alignment in-line; there is no facility for
; patching or intercepting these values due to their complexity.  The only requirements
; are:
;
;  1. Carry Clear -> 16-bit write and return to the next code field operand
;  2. Carry Set 
;     a. Overflow set   -> Low 8-bit write and return to the next code field operand
;     b. Overflow clear -> High 8-bit write and exit the line
;     c. Always clear the Carry flags. It's actually OK to leave the overflow bit in 
;        its passed state, because having the carry bit clear prevent evaluation of
;        the V bit.
;
; Snippet Samples:
;
; Standard Two-level Mix (27 bytes)
;
;   Optimal     = 18 cycles (LDA/AND/ORA/PHA)
;  16-bit write = 23 cycles 
;   8-bit low   = 35 cycles
;   8-bit high  = 36 cycles
;
;  start     lda  (00),y
;            and  #MASK
;            ora  #DATA         ; 14 cycles to load the data
;            bcs  8_bit
;            pha
;  out       jmp  next          ; Fast-path completes in 9 additional cycles

;  8_bit     sep  #$30          ; Switch to 8 bit mode
;            bvs  r_edge        ; Need to switch if doing the left edge
;            xba
;  r_edge    pha                ; push the value
;            rep  #$31          ; put back into 16-bit mode and clear the carry bit, as required
;            bvs  out           ; jmp out and continue if this is the right edge
;            jmp  even_exit     ; exit the line otherwise
;                               ;
;                               ; The slow paths have 21 and 22 cycles for the right and left
;                               ; odd-aligned cases respectively.

snippets     ds    32*82