iigs-game-engine/ref/GTE.Line.s

; Template and utility function for a single line of the GTE blitter. This is a memory
; hog. Because, potentially, all of the registers (X, Y, D, SP, B) are in use, we only
; have the P-register and PC for flow control. A lot of the code is replicated so that
; different piece of code run at different address so that JMP instruction can be used
; for flow control.
;
; Any JMP instruction with an address of $20XX will have its low byte set when the 
; scroll position of screen is set.  If the scroll position is set, repeately calling
; the blitter will refresh the screen each time without re-applying all of the patches
; that depend on the scroll position.
;
; When called
;  * Interrupts are off
;  * Bank address is set to background 1
;  * Direct Page is set to the data field
;    + First 256 bytes are pointers to background 1 mask data
;    + Next 2048 bytes are dynamic tile data
;  * Bank 00 read
;  * Bank 01 write
;
; Each line takes up 8kb and is aligned to a multiple of $2000
;  NOTE: Each line may only need 4kb -- space requirements driven by snippet complexity
;
; Each 3-byte sequence in the code field is one of
;
;  PEA $0000         => F4 00 00 = %1111 0100
;  LDA 00     / PHA  => A5 00 48 = %1010 0101
;  LDA 00,x   / PHA  => B5 00 48 = %1011 0101
;  LDA (00),y / PHA  => B1 00 48 = %1011 0001
;  LDA 00,s   / PHA  => A3 00 48 = %1010 0011
;  JMP 0000          => 4C 00 00 = %0100 1100
;
; Only the JMP opcode is less than $80 and all of the others perform their work inline,
; so this gives us a fast test to help extract only the high or low byte of each word
;
; So, before diving into the code, just how fast is it? The architecture of GTE allows simple 
; things to be fast and complex things to not be slow.  That is, the developer has a lot
; of control over the time taken to render the full screen based on how complex it can be.
;
; That said, I'll cover three cases, ranging from the simple (a single background) to the
; complex (2 backgrounds, 50% mixed).  The even- and odd-aligned cases are also broken out.
;
; Simple case; all elements of the code field are PEA instructions
;
;  Even:
;    - Start at entry_3, 8 cycles to jump into the code field
;    - 80 PEA instructions + one JMP = 403 cycles
;    - BRA and JMP = 6 cycles
;    - Final JMP to next line = 3 cycles
;      -- total of 420 cycles / line of which 400 were spent doing necessary instructions
;      -- theoretically almost 30 fps
;
;  Odd:
;    - Start at entry_3, 17 cycles to get to r_is_pea. If a second background is never used,
;      this template could be specialized and reduce this overhead to 11 cycles.
;    - 15 cycles to push the 8-bit right edge
;    - 78 PEA instructions + one JMP = 393 cycles
;    - 50% JMP to odd_exit = 1.5 cycles, amortized
;    - 24 cycles to push the 8-bit left edge
;    - Final JMP to next line = 3 cycles
;      -- total 453.5 cycles / line
;      -- theoretically 27.5 fps
;
; Complex; 25% of code-field is PEA, 25% is LDA (00),y / PHA, and 50% is mixed
;
;  Even:
;    - Start at entry_3, 8 cycles to jump into the code field
;    - Code Field
;      - 20 PEA instruction  =  100 cycles
;      - 20 LDA (00),y / PHA =  240 cycles
;      - 20 JMP / Fast Path  = 1040 cycles
;      - JMP loop            =    3 cycles
;    - BRA and JMP = 6 cycles
;    - Final JMP to next line = 3 cycles
;      -- total of 1,517 cycles / line of which 700 were spent doing necessary instructions
;      -- theoretically about 8 fps
;
;  Odd:
             MX    %00
entry_1      ldx   #0000          ; patch with the address of the direct page tiles. Fixed.
entry_2      ldy   #0000          ; patch with the address of the line in the second layer. Set when BG1 scroll position changes.
entry_3      lda   #0000          ; patch with the address of the right edge of the line. Set when origin position changes.
             tcs

entry_jmp    jmp   $2000
             dfb   00             ; if the screen is odd-aligned, then the opcode is set to 
;                                 ; $AF to convert to a LDA long instruction.  This puts the
;                                 ; first two bytes of the instruction field in the accumulator
;                                 ; and falls through to the next instruction.
;
;                                 ; We structure the line so that the entry point only needs to
;                                 ; update the low-byte of the address, the means it takes only
;                                 ; an amortized 4-cycles per line to set the entry pointbra

right_odd    bit   #$000B         ; Check the bottom nibble to quickly identify a PEA instruction
             beq   r_is_pea       ; This costs 6 cycles in the fast-path

             bit   #$0040         ; Check bit 6 to distinguish between JMP and all of the LDA variants
             bne   r_is_jmp

             stal  r_lda_patch+1  ; Original word is still in the accumulator.  Execute it. We inline 
r_lda_patch  dfb   00,00          ; this here to avoid needing a BRA instruction back.  So the fast-path
;                                 ; gets a 1-cycle penalty, but we save 3 cycles here.

r_is_pea     xba                  ; fast code for PEA
             sep   #$30
             pha
             rep   #$30
             jmp   $2003          ; unconditionally jump into the "next" instruction in the 
;                                 ; code field.  This is OK, even if the entry point was the
;                                 ; last instruction, because there is a JMP at the end of
;                                 ; the code field, so the code will simply jump to that
;                                 ; instruction directly.
;                                 ;
;                                 ; As with the original entry point, because all of the
;                                 ; code field is page-aligned, only the low byte needs to
;                                 ; be updated when the scroll position changes

r_is_jmp     sep   #$41           ; Set the C and V flags which tells a snippet to push only the low byte
             ldal  entry_jmp+1
             stal  r_jmp_patch+1
r_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 16 cycles, is there a better way?)

; This is the spot that needs to be page-aligned. In addition to simplifying the entry address
; and only needing to update a byte instad of a word, because the code breaks out of the
; code field with a BRA instruction, we keep everything within a page to avoid the 1-cycle
; page-crossing penalty of the branch.
             jmp   odd_exit       ; +0   Alternate exit point depending on whether the left edge is 
             jmp   even_exit      ; +3   odd-aligned

loop         lup   82             ; +6   Set up 82 PEA instructions, which is 328 pixels and consumes 246 bytes
             pea   $0000          ;      This is 41 8x8 tiles in width.  Need to have N+1 tiles for screen overlap
             --^
             jmp   loop           ; +252 Ensure execution continues to loop around
             jmp   even_exit      ; +255

odd_exit     lda   #0000          ; This operand field is *always* used to hold the original 2 bytes of the code field
;                                 ; that are replaced by the needed BRA instruction to exit the code field.  When the
;                                 ; left edge is odd-aligned, we are able to immediately load the value and perform
;                                 ; similar logic to the right_odd code path above

left_odd     bit   #$000B
             beq   l_is_pea

             bit   #$0040
             bne   l_is_jmp

             stal  l_lda_patch+1
l_lda_patch  dfb   00,00
l_is_pea     xba
             sep   #$30
             pha
             rep   #$30
             bra   even_exit
l_is_jmp     sep   #$01           ; Set the C flag (V is always cleared at this point) which tells a snippet to push only the high byte
             ldal  entry_jmp+1
             stal  l_jmp_patch+1
l_jmp_patch  dfb   $4C,$00,$00    ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)

even_exit    jmp   $0000          ; Jump to the next line.  We set up the blitter to do 8 or 16 lines at a time
;                                 ; before restoring the machine state and re-enabling interrupts.  This makes
;                                 ; the blitter interrupt friendly to allow things like music player to continue
;                                 ; to function.
;
;                                 ; When it's time to exit, the next_entry address points to an alternate exit point

; These are the special code snippets -- there is a 1:1 relationship between each snippet space
; and a 3-byte entry in the code field. Thus, each snippet has a hard-coded JMP to return to 
; the next code field location
;
; The snippet is required to handle the odd-alignment in-line; there is no facility for
; patching or intercepting these values due to their complexity.  The only requirements
; are:
;
;  1. Carry Clear -> 16-bit write and return to the next code field operand
;  2. Carry Set 
;     a. Overflow set   -> Low 8-bit write and return to the next code field operand
;     b. Overflow clear -> High 8-bit write and exit the line
;     c. Always clear the Carry flags. It's actually OK to leave the overflow bit in 
;        its passed state, because having the carry bit clear prevent evaluation of
;        the V bit.
;
; Snippet Samples:
;
; Standard Two-level Mix (27 bytes)
;
;   Optimal     = 18 cycles (LDA/AND/ORA/PHA)
;  16-bit write = 23 cycles 
;   8-bit low   = 35 cycles
;   8-bit high  = 36 cycles
;
;  start     lda  (00),y
;            and  #MASK
;            ora  #DATA         ; 14 cycles to load the data
;            bcs  8_bit
;            pha
;  out       jmp  next          ; Fast-path completes in 9 additional cycles

;  8_bit     sep  #$30          ; Switch to 8 bit mode
;            bvs  r_edge        ; Need to switch if doing the left edge
;            xba
;  r_edge    pha                ; push the value
;            rep  #$31          ; put back into 16-bit mode and clear the carry bit, as required
;            bvs  out           ; jmp out and continue if this is the right edge
;            jmp  even_exit     ; exit the line otherwise
;                               ;
;                               ; The slow paths have 21 and 22 cycles for the right and left
;                               ; odd-aligned cases respectively.

snippets     ds    32*82
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`; Template and utility function for a single line of the GTE blitter. This is a memory`
			`; hog. Because, potentially, all of the registers (X, Y, D, SP, B) are in use, we only`
			`; have the P-register and PC for flow control. A lot of the code is replicated so that`
			`; different piece of code run at different address so that JMP instruction can be used`
			`; for flow control.`
			`;`
			`; Any JMP instruction with an address of $20XX will have its low byte set when the`
			`; scroll position of screen is set. If the scroll position is set, repeately calling`
			`; the blitter will refresh the screen each time without re-applying all of the patches`
			`; that depend on the scroll position.`
			`;`
			`; When called`
			`; * Interrupts are off`
			`; * Bank address is set to background 1`
			`; * Direct Page is set to the data field`
			`; + First 256 bytes are pointers to background 1 mask data`
			`; + Next 2048 bytes are dynamic tile data`
			`; * Bank 00 read`
			`; * Bank 01 write`
			`;`
			`; Each line takes up 8kb and is aligned to a multiple of $2000`
			`; NOTE: Each line may only need 4kb -- space requirements driven by snippet complexity`
			`;`
			`; Each 3-byte sequence in the code field is one of`
			`;`
			`; PEA $0000 => F4 00 00 = %1111 0100`
			`; LDA 00 / PHA => A5 00 48 = %1010 0101`
			`; LDA 00,x / PHA => B5 00 48 = %1011 0101`
			`; LDA (00),y / PHA => B1 00 48 = %1011 0001`
			`; LDA 00,s / PHA => A3 00 48 = %1010 0011`
			`; JMP 0000 => 4C 00 00 = %0100 1100`
			`;`
			`; Only the JMP opcode is less than $80 and all of the others perform their work inline,`
			`; so this gives us a fast test to help extract only the high or low byte of each word`
			`;`
			`; So, before diving into the code, just how fast is it? The architecture of GTE allows simple`
			`; things to be fast and complex things to not be slow. That is, the developer has a lot`
			`; of control over the time taken to render the full screen based on how complex it can be.`
			`;`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`; That said, I'll cover three cases, ranging from the simple (a single background) to the`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`; complex (2 backgrounds, 50% mixed). The even- and odd-aligned cases are also broken out.`
			`;`
			`; Simple case; all elements of the code field are PEA instructions`
			`;`
			`; Even:`
			`; - Start at entry_3, 8 cycles to jump into the code field`
			`; - 80 PEA instructions + one JMP = 403 cycles`
			`; - BRA and JMP = 6 cycles`
			`; - Final JMP to next line = 3 cycles`
			`; -- total of 420 cycles / line of which 400 were spent doing necessary instructions`
			`; -- theoretically almost 30 fps`
			`;`
			`; Odd:`
			`; - Start at entry_3, 17 cycles to get to r_is_pea. If a second background is never used,`
			`; this template could be specialized and reduce this overhead to 11 cycles.`
			`; - 15 cycles to push the 8-bit right edge`
			`; - 78 PEA instructions + one JMP = 393 cycles`
			`; - 50% JMP to odd_exit = 1.5 cycles, amortized`
			`; - 24 cycles to push the 8-bit left edge`
			`; - Final JMP to next line = 3 cycles`
			`; -- total 453.5 cycles / line`
			`; -- theoretically 27.5 fps`
			`;`
			`; Complex; 25% of code-field is PEA, 25% is LDA (00),y / PHA, and 50% is mixed`
			`;`
			`; Even:`
			`; - Start at entry_3, 8 cycles to jump into the code field`
			`; - Code Field`
			`; - 20 PEA instruction = 100 cycles`
			`; - 20 LDA (00),y / PHA = 240 cycles`
			`; - 20 JMP / Fast Path = 1040 cycles`
			`; - JMP loop = 3 cycles`
			`; - BRA and JMP = 6 cycles`
			`; - Final JMP to next line = 3 cycles`
			`; -- total of 1,517 cycles / line of which 700 were spent doing necessary instructions`
			`; -- theoretically about 8 fps`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`;`
			`; Odd:`
			`MX %00`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`entry_1 ldx #0000 ; patch with the address of the direct page tiles. Fixed.`
			`entry_2 ldy #0000 ; patch with the address of the line in the second layer. Set when BG1 scroll position changes.`
			`entry_3 lda #0000 ; patch with the address of the right edge of the line. Set when origin position changes.`
			`tcs`

			`entry_jmp jmp $2000`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`dfb 00 ; if the screen is odd-aligned, then the opcode is set to`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`; ; $AF to convert to a LDA long instruction. This puts the`
			`; ; first two bytes of the instruction field in the accumulator`
			`; ; and falls through to the next instruction.`
			`;`
			`; ; We structure the line so that the entry point only needs to`
			`; ; update the low-byte of the address, the means it takes only`
			`; ; an amortized 4-cycles per line to set the entry pointbra`

			`right_odd bit #$000B ; Check the bottom nibble to quickly identify a PEA instruction`
			`beq r_is_pea ; This costs 6 cycles in the fast-path`

			`bit #$0040 ; Check bit 6 to distinguish between JMP and all of the LDA variants`
			`bne r_is_jmp`

			`stal r_lda_patch+1 ; Original word is still in the accumulator. Execute it. We inline`
			`r_lda_patch dfb 00,00 ; this here to avoid needing a BRA instruction back. So the fast-path`
			`; ; gets a 1-cycle penalty, but we save 3 cycles here.`

			`r_is_pea xba ; fast code for PEA`
			`sep #$30`
			`pha`
			`rep #$30`
			`jmp $2003 ; unconditionally jump into the "next" instruction in the`
			`; ; code field. This is OK, even if the entry point was the`
			`; ; last instruction, because there is a JMP at the end of`
			`; ; the code field, so the code will simply jump to that`
			`; ; instruction directly.`
			`; ;`
			`; ; As with the original entry point, because all of the`
			`; ; code field is page-aligned, only the low byte needs to`
			`; ; be updated when the scroll position changes`

			`r_is_jmp sep #$41 ; Set the C and V flags which tells a snippet to push only the low byte`
			`ldal entry_jmp+1`
			`stal r_jmp_patch+1`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`r_jmp_patch dfb $4C,$00,$00 ; Jump back to address in entry_jmp (this takes 16 cycles, is there a better way?)`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00
			`; This is the spot that needs to be page-aligned. In addition to simplifying the entry address`
			`; and only needing to update a byte instad of a word, because the code breaks out of the`
			`; code field with a BRA instruction, we keep everything within a page to avoid the 1-cycle`
			`; page-crossing penalty of the branch.`
			`jmp odd_exit ; +0 Alternate exit point depending on whether the left edge is`
			`jmp even_exit ; +3 odd-aligned`

			`loop lup 82 ; +6 Set up 82 PEA instructions, which is 328 pixels and consumes 246 bytes`
			`pea $0000 ; This is 41 8x8 tiles in width. Need to have N+1 tiles for screen overlap`
			`--^`
			`jmp loop ; +252 Ensure execution continues to loop around`
			`jmp even_exit ; +255`

Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`odd_exit lda #0000 ; This operand field is always used to hold the original 2 bytes of the code field`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`; ; that are replaced by the needed BRA instruction to exit the code field. When the`
			`; ; left edge is odd-aligned, we are able to immediately load the value and perform`
			`; ; similar logic to the right_odd code path above`

			`left_odd bit #$000B`
			`beq l_is_pea`

			`bit #$0040`
			`bne l_is_jmp`

			`stal l_lda_patch+1`
			`l_lda_patch dfb 00,00`
			`l_is_pea xba`
			`sep #$30`
			`pha`
			`rep #$30`
			`bra even_exit`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`l_is_jmp sep #$01 ; Set the C flag (V is always cleared at this point) which tells a snippet to push only the high byte`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`ldal entry_jmp+1`
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`stal l_jmp_patch+1`
			`l_jmp_patch dfb $4C,$00,$00 ; Jump back to address in entry_jmp (this takes 13 cycles, is there a better way?)`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00
Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`even_exit jmp $0000 ; Jump to the next line. We set up the blitter to do 8 or 16 lines at a time`
Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00			`; ; before restoring the machine state and re-enabling interrupts. This makes`
			`; ; the blitter interrupt friendly to allow things like music player to continue`
			`; ; to function.`
			`;`
			`; ; When it's time to exit, the next_entry address points to an alternate exit point`

			`; These are the special code snippets -- there is a 1:1 relationship between each snippet space`
			`; and a 3-byte entry in the code field. Thus, each snippet has a hard-coded JMP to return to`
			`; the next code field location`
			`;`
			`; The snippet is required to handle the odd-alignment in-line; there is no facility for`
			`; patching or intercepting these values due to their complexity. The only requirements`
			`; are:`
			`;`
			`; 1. Carry Clear -> 16-bit write and return to the next code field operand`
			`; 2. Carry Set`
			`; a. Overflow set -> Low 8-bit write and return to the next code field operand`
			`; b. Overflow clear -> High 8-bit write and exit the line`
			`; c. Always clear the Carry flags. It's actually OK to leave the overflow bit in`
			`; its passed state, because having the carry bit clear prevent evaluation of`
			`; the V bit.`
			`;`
			`; Snippet Samples:`
			`;`
			`; Standard Two-level Mix (27 bytes)`
			`;`
			`; Optimal = 18 cycles (LDA/AND/ORA/PHA)`
			`; 16-bit write = 23 cycles`
			`; 8-bit low = 35 cycles`
			`; 8-bit high = 36 cycles`
			`;`
			`; start lda (00),y`
			`; and #MASK`
			`; ora #DATA ; 14 cycles to load the data`
			`; bcs 8_bit`
			`; pha`
			`; out jmp next ; Fast-path completes in 9 additional cycles`

			`; 8_bit sep #$30 ; Switch to 8 bit mode`
			`; bvs r_edge ; Need to switch if doing the left edge`
			`; xba`
			`; r_edge pha ; push the value`
			`; rep #$31 ; put back into 16-bit mode and clear the carry bit, as required`
			`; bvs out ; jmp out and continue if this is the right edge`
			`; jmp even_exit ; exit the line otherwise`
			`; ;`
			`; ; The slow paths have 21 and 22 cycles for the right and left`
			`; ; odd-aligned cases respectively.`

Update skeleon to show a pictire on-screen 2020-08-19 05:35:30 +00:00			`snippets ds 32*82`












Add old docs about theoretical GTE blitter core 2020-08-16 21:37:23 +00:00
More play-around 2020-08-23 05:25:39 +00:00