Add old docs about theoretical GTE blitter core

2025-08-05 11:25:05 +00:00 · 2020-08-16 16:37:23 -05:00
parent a99a06f024
commit 3ba1564719
1 changed files with 208 additions and 0 deletions
--- a/src/GTE.Line.s
+++ b/src/GTE.Line.s
@@ -0,0 +1,208 @@
+; Template and utility function for a single line of the GTE blitter. This is a memory
+; hog. Because, potentially, all of the registers (X, Y, D, SP, B) are in use, we only
+; have the P-register and PC for flow control. A lot of the code is replicated so that
+; different piece of code run at different address so that JMP instruction can be used
+; for flow control.
+;
+; Any JMP instruction with an address of $20XX will have its low byte set when the 
+; scroll position of screen is set.  If the scroll position is set, repeately calling
+; the blitter will refresh the screen each time without re-applying all of the patches
+; that depend on the scroll position.
+;
+; When called
+;  * Interrupts are off
+;  * Bank address is set to background 1
+;  * Direct Page is set to the data field
+;    + First 256 bytes are pointers to background 1 mask data
+;    + Next 2048 bytes are dynamic tile data
+;  * Bank 00 read
+;  * Bank 01 write
+;
+; Each line takes up 8kb and is aligned to a multiple of $2000
+;  NOTE: Each line may only need 4kb -- space requirements driven by snippet complexity
+;
+; Each 3-byte sequence in the code field is one of
+;
+;  PEA $0000         => F4 00 00 = %1111 0100
+;  LDA 00     / PHA  => A5 00 48 = %1010 0101
+;  LDA 00,x   / PHA  => B5 00 48 = %1011 0101
+;  LDA (00),y / PHA  => B1 00 48 = %1011 0001
+;  LDA 00,s   / PHA  => A3 00 48 = %1010 0011
+;  JMP 0000          => 4C 00 00 = %0100 1100
+;
+; Only the JMP opcode is less than $80 and all of the others perform their work inline,
+; so this gives us a fast test to help extract only the high or low byte of each word
+;
+; So, before diving into the code, just how fast is it? The architecture of GTE allows simple 
+; things to be fast and complex things to not be slow.  That is, the developer has a lot
+; of control over the time taken to render the full screen based on how complex it can be.
+;
+; That said, I'll cover three cases, ranging from the simple (a single background) and
+; complex (2 backgrounds, 50% mixed).  The even- and odd-aligned cases are also broken out.
+;
+; Simple case; all elements of the code field are PEA instructions
+;
+;  Even:
+;    - Start at entry_3, 8 cycles to jump into the code field
+;    - 80 PEA instructions + one JMP = 403 cycles
+;    - BRA and JMP = 6 cycles
+;    - Final JMP to next line = 3 cycles
+;      -- total of 420 cycles / line of which 400 were spent doing necessary instructions
+;      -- theoretically almost 30 fps
+;
+;  Odd:
+;    - Start at entry_3, 17 cycles to get to r_is_pea. If a second background is never used,
+;      this template could be specialized and reduce this overhead to 11 cycles.
+;    - 15 cycles to push the 8-bit right edge
+;    - 78 PEA instructions + one JMP = 393 cycles
+;    - 50% JMP to odd_exit = 1.5 cycles, amortized
+;    - 24 cycles to push the 8-bit left edge
+;    - Final JMP to next line = 3 cycles
+;      -- total 453.5 cycles / line
+;      -- theoretically 27.5 fps
+;
+; Complex; 25% of code-field is PEA, 25% is LDA (00),y / PHA, and 50% is mixed
+;
+;  Even:
+;    - Start at entry_3, 8 cycles to jump into the code field
+;    - Code Field
+;      - 20 PEA instruction  =  100 cycles
+;      - 20 LDA (00),y / PHA =  240 cycles
+;      - 20 JMP / Fast Path  = 1040 cycles
+;      - JMP loop            =    3 cycles
+;    - BRA and JMP = 6 cycles
+;    - Final JMP to next line = 3 cycles
+;      -- total of 1,517 cycles / line of which 700 were spent doing necessary instructions
+;      -- theoretically about 8 fps
+
+entry_1      ldx   #0000          ; patch with the address of the direct page tiles. Fixed.
+entry_2      ldy   #0000          ; patch with the address of the line in the second layer. Set when BG1 scroll position changes.
+entry_3      lda   #0000          ; patch with the address of the right edge of the line. Set when origin position changes.
+             tcs
+
+entry_jmp    jmp   $2000
+             dfb   00             ; of the screen is odd-aligned, then the opcode is set to 
+;                                 ; $AF to convert to a LDA long instruction.  This puts the
+;                                 ; first two bytes of the instruction field in the accumulator
+;                                 ; and falls through to the next instruction.
+;
+;                                 ; We structure the line so that the entry point only needs to
+;                                 ; update the low-byte of the address, the means it takes only
+;                                 ; an amortized 4-cycles per line to set the entry pointbra
+
+right_odd    bit   #$000B         ; Check the bottom nibble to quickly identify a PEA instruction
+             beq   r_is_pea       ; This costs 6 cycles in the fast-path
+
+             bit   #$0040         ; Check bit 6 to distinguish between JMP and all of the LDA variants
+             bne   r_is_jmp
+
+             stal  r_lda_patch+1  ; Original word is still in the accumulator.  Execute it. We inline 
+r_lda_patch  dfb   00,00          ; this here to avoid needing a BRA instruction back.  So the fast-path
+;                                 ; gets a 1-cycle penalty, but we save 3 cycles here.
+
+r_is_pea     xba                  ; fast code for PEA
+             sep   #$30
+             pha
+             rep   #$30
+             jmp   $2003          ; unconditionally jump into the "next" instruction in the 
+;                                 ; code field.  This is OK, even if the entry point was the
+;                                 ; last instruction, because there is a JMP at the end of
+;                                 ; the code field, so the code will simply jump to that
+;                                 ; instruction directly.
+;                                 ;
+;                                 ; As with the original entry point, because all of the
+;                                 ; code field is page-aligned, only the low byte needs to
+;                                 ; be updated when the scroll position changes
+
+r_is_jmp     sep   #$41           ; Set the C and V flags which tells a snippet to push only the low byte
+             ldal  entry_jmp+1
+             stal  r_jmp_patch+1
+r_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)
+
+; This is the spot that needs to be page-aligned. In addition to simplifying the entry address
+; and only needing to update a byte instad of a word, because the code breaks out of the
+; code field with a BRA instruction, we keep everything within a page to avoid the 1-cycle
+; page-crossing penalty of the branch.
+             jmp   odd_exit       ; +0   Alternate exit point depending on whether the left edge is 
+             jmp   even_exit      ; +3   odd-aligned
+
+loop         lup   82             ; +6   Set up 82 PEA instructions, which is 328 pixels and consumes 246 bytes
+             pea   $0000          ;      This is 41 8x8 tiles in width.  Need to have N+1 tiles for screen overlap
+             --^
+             jmp   loop           ; +252 Ensure execution continues to loop around
+             jmp   even_exit      ; +255
+
+odd_exit     lda   #patch         ; This operabd field is *always* used to hold the original 2 bytes of the code field
+;                                 ; that are replaced by the needed BRA instruction to exit the code field.  When the
+;                                 ; left edge is odd-aligned, we are able to immediately load the value and perform
+;                                 ; similar logic to the right_odd code path above
+
+left_odd     bit   #$000B
+             beq   l_is_pea
+
+             bit   #$0040
+             bne   l_is_jmp
+
+             stal  l_lda_patch+1
+l_lda_patch  dfb   00,00
+l_is_pea     xba
+             sep   #$30
+             pha
+             rep   #$30
+             bra   even_exit
+r_is_jmp     sep   #$01           ; Set the C flag (V is always cleared at this point) which tells a snippet to push only the high byte
+             ldal  entry_jmp+1
+             stal  r_jmp_patch+1
+r_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)
+
+even_exit    jmp   next_entry     ; Jump to the next line.  We set up the blitter to do 8 or 16 lines at a time
+;                                 ; before restoring the machine state and re-enabling interrupts.  This makes
+;                                 ; the blitter interrupt friendly to allow things like music player to continue
+;                                 ; to function.
+;
+;                                 ; When it's time to exit, the next_entry address points to an alternate exit point
+
+; These are the special code snippets -- there is a 1:1 relationship between each snippet space
+; and a 3-byte entry in the code field. Thus, each snippet has a hard-coded JMP to return to 
+; the next code field location
+;
+; The snippet is required to handle the odd-alignment in-line; there is no facility for
+; patching or intercepting these values due to their complexity.  The only requirements
+; are:
+;
+;  1. Carry Clear -> 16-bit write and return to the next code field operand
+;  2. Carry Set 
+;     a. Overflow set   -> Low 8-bit write and return to the next code field operand
+;     b. Overflow clear -> High 8-bit write and exit the line
+;     c. Always clear the Carry flags. It's actually OK to leave the overflow bit in 
+;        its passed state, because having the carry bit clear prevent evaluation of
+;        the V bit.
+;
+; Snippet Samples:
+;
+; Standard Two-level Mix (27 bytes)
+;
+;   Optimal     = 18 cycles (LDA/AND/ORA/PHA)
+;  16-bit write = 23 cycles 
+;   8-bit low   = 35 cycles
+;   8-bit high  = 36 cycles
+;
+;  start     lda  (00),y
+;            and  #MASK
+;            ora  #DATA         ; 14 cycles to load the data
+;            bcs  8_bit
+;            pha
+;  out       jmp  next          ; Fast-path completes in 9 additional cycles
+
+;  8_bit     sep  #$30          ; Switch to 8 bit mode
+;            bvs  r_edge        ; Need to switch if doing the left edge
+;            xba
+;  r_edge    pha                ; push the value
+;            rep  #$31          ; put back into 16-bit mode and clear the carry bit, as required
+;            bvs  out           ; jmp out and continue if this is the right edge
+;            jmp  even_exit     ; exit the line otherwise
+;                               ;
+;                               ; The slow paths have 21 and 22 cycles for the right and left
+;                               ; odd-aligned cases respectively.
+
+