8BITCOIN/HASH.s

2607 lines
80 KiB
ArmAsm
Raw Permalink Normal View History

DSK HASH
**************************************************
* Note:
* If the code looks inefficient and redundant, it is.
* Lots of things that should loop are unrolled for speed.
* Built with Merlin32 from Brutal Deluxe: brutaldeluxe.fr/products/crossdevtools/merlin/
*
* Timing measurements include setup, hashing and printing result.
*
* V1: 947,063 cycles.
*
* Macros: converted several common subroutines to macros
* for additional speed, at expense of memory size.
* Currently runs 1 hash in 922,394 cycles.
*
* Zero page: moved some variables to zero page from Memory
* Down to 916,595 cycles.
*
* Further optimized, more macros: 875,166
* Further optimized: 804,584
* Unrolled shift routines: 682,744 (Now, it's getting good.)
* Pulled out all the stops: 671,524 (Time to hand off to Qkumba.)
* What's being hashed is: version4 + previous block hash32 + merkel root32 + time4 + bits (target)4 + nonce4 = blockheader (80 bytes)
* 80 bytes gets broken into two message blocks
* This gets hashed twice sha256(sha256(blockheader))
* https://en.wikipedia.org/wiki/SHA-2
* http://www.yogh.io/#mine:last
* https://en.bitcoin.it/wiki/Block_hashing_algorithm
* https://en.bitcoin.it/wiki/Protocol_documentation
* total time to run a complete hash (three SHA256 iterations total): 1,994,362 cycles
* more loops to macros (LUP): 1,906,858
* 1,559,735
* Qkumba optimization, part 1: 1,248,727
* 1,238,169
* 1,218,659
* added nonce display, unrolled a couple of loops 1,210,817
* where next? PRBYTE is 255
* CROUT is 3,248 (!)
* printing the result is 19,396 cycles alone!
* down to 11,455 now
* RORx8 into Byte 1,096,763
* more macros. 1,068,543
*
* instead of CROUTx2, set cursor to top of window and print over top? 1,055,753
* whoa 929,375
* rotating Vn macro 891,692
* MAJ and CHOICE optimized 831,999
* XOR32 to combined macro 796,115
* 780,145
* unrolled HASHTOMESSAGE 779,427 = 103.765 years
* who needs 1 VTAB? 779,255
* or 2 CROUTs? 778,751
*
* LDVV - MACROS with parameters. 775,323
* macros with parameters? Mind. Blown.
* target: 751,237 = 1 full 2^32 hashes in 100 years.
* unrolled COPYCHUNK1 714,904
* replaced prbyte with my own: 708,036
* special temp/s0 macros: 662,220
* LDAW, LDWADDX, LDS, inline JSRs 627,428
* caching the result of chunk0 on pass0: 419,137
* QKUMBA SAYS "IT'S A BIT FASTER" 269,329
**************************************************
* Variables
**************************************************
INPUT32 EQU $E0 ; DS 4 ; 32-bit Accumulator
XREGISTER32 EQU $E4 ; DS 4 ; input 1 for XOR, etc (X)
YREGISTER32 EQU $E8 ; DS 4 ; input 2 for MAJ, etc (Y)
RESULT32 EQU $EC ; DS 4 ; temp storage for various operations
CURRENTCHUNK EQU $FF ; chunk zero or one.
HASHPASS EQU $FE ; pass zero or one.
CURRENTMESSAGELO EQU $FC
CURRENTMESSAGEHI EQU $FD
S0 EQU $80
S1 EQU $84
TEMP0 EQU $88 ; temp storage for various operations
TEMP1 EQU $8C ; temp storage for various operations
HASHCACHED EQU $D0 ; is the first pass already done, and the result cached in CACHEDHASH?
**************************************************
* Apple Standard Memory Locations
**************************************************
CLRLORES EQU $F832
LORES EQU $C050
TXTSET EQU $C051
MIXCLR EQU $C052
MIXSET EQU $C053
TXTPAGE1 EQU $C054
TXTPAGE2 EQU $C055
KEY EQU $C000
C80STOREOFF EQU $C000
C80STOREON EQU $C001
STROBE EQU $C010
SPEAKER EQU $C030
VBL EQU $C02E
RDVBLBAR EQU $C019 ; not VBL (VBL signal low
WAIT EQU $FCA8
RAMWRTAUX EQU $C005
RAMWRTMAIN EQU $C004
SETAN3 EQU $C05E ; Set annunciator-3 output to 0
SET80VID EQU $C00D ; enable 80-column display mode (WR-only)
HOME EQU $FC58 ; clear the text screen
VTAB EQU $FC22 ; Sets the cursor vertical position (from CV)
COUT EQU $FDED ; Calls the output routine whose address is stored in CSW,
;COUTI EQU $fbf0 ; normally COUTI
CROUT EQU $FD8E ; prints CR
STROUT EQU $DB3A ;Y=String ptr high, A=String ptr low
PRBYTE EQU $FDDA ; print hex byte in A
ALTTEXT EQU $C055
ALTTEXTOFF EQU $C054
PB0 EQU $C061 ; paddle 0 button. high bit set when pressed.
PDL0 EQU $C064 ; paddle 0 value, or should I use PREAD?
PREAD EQU $FB1E
ROMINIT EQU $FB2F
ROMSETKBD EQU $FE89
ROMSETVID EQU $FE93
ALTCHAR EQU $C00F ; enables alternative character set - mousetext
CH EQU $24 ; cursor Horiz
CV EQU $25 ; cursor Vert
WNDWDTH EQU $21 ; Width of text window
WNDTOP EQU $22 ; Top of text window
**************************************************
* START - sets up various fiddly zero page bits
**************************************************
ORG $2000 ; PROGRAM DATA STARTS AT $2000
JSR HOME ; clear screen
STA $C050 ; rw:TXTCLR ; Set Lo-res page 1, mixed graphics + text
STA $C053 ; rw:MIXSET
STA $C054 ; rw:TXTPAGE1
STA $C056 ; rw:LORES
JSR FILLSCREENFAST ; blanks screen to black.
JSR SPLASHSCREEN ; fancy lo-res graphics
LDA #$00
STA HASHCACHED ; clear cache status
JSR FLIPCOIN
**************************************************
* SETUP
**************************************************
*
* Initialize hash values:
* (first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19):
* See HTABLE
*
* Initialize array of round constants:
* (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
* See KTABLE
*
* Pre-processing (Padding):
* begin with the original message of length L bits (80*8 = 640bits)
* append a single '1' bit (641bits)
* means shifting everything over 1 bit to be 81 bytes
* append K '0' bits, where K is the minimum number >= 0 such that L + 1 + K + 64 is a multiple of 512 (640+1+K+64=1024 K=319)
* append L as a 64-bit big-endian integer, making the total post-processed length a multiple of 512 bits (append 0000000000000280)
**************************************************
* Pre-processing (Padding):
**************************************************
; Start with MESSAGE padded out to 1024bits (see MESSAGE below)
* Process the message in successive 512-bit chunks:
* break message into 512-bit chunks
* 80byte header yields 1024bit message, so chunks = 2
* Cache result of first chunk, so subsequent passes are cache then hash.
PREPROCESS
LDA #$00
STA HASHPASS ; pass the first = 0
STA CURRENTCHUNK ; chunk the first = 0
LDA MESSAGELO
STA CURRENTMESSAGELO
LDA MESSAGEHI
STA CURRENTMESSAGEHI
INITIALIZEHASH ; for the 32 bytes in INITIALHASH, push them into H00-H07
INITIALHASHES
]hashnumber = 31
LUP 32
LDA INITIALHASH + ]hashnumber
STA H00 + ]hashnumber
]hashnumber = ]hashnumber - 1
--^
* for each chunk
* create a 64-entry message schedule array w[0..63] of 32-bit words
* (The initial values in w[0..63] don't matter, so many implementations zero them here)
* See WTABLE
* copy chunk into first 16 words w[0..15] of the message schedule array
COPYCHUNKS
CHECKCACHE
; if HASHCACHED == 1
; AND chunk=0 AND pass=0
; then read from CACHEDHASH
LDA HASHCACHED ; has chunk0 pass0 already done?
BEQ NOTCACHED
CACHEDONE LDA HASHPASS ; pass = 0
ORA CURRENTCHUNK ; chunk = 0
BEQ CACHETOHASH
NOTCACHED JMP NOCACHE
CACHETOHASH
]cachebyte = 0
LUP 32
LDA CACHEDHASH + ]cachebyte
STA H00 + ]cachebyte
]cachebyte = ]cachebyte+1
--^
JMP CHECKCHUNK
NOCACHE
LDA CURRENTCHUNK ; which chunk?
BNE NEXTCHUNK ; skip chunk0 if already done
LDY #$3F ; Y = 63 to 0 on chunk 0, then 64 to 127 on chunk 1
COPYCHUNK0 LDA (CURRENTMESSAGELO),Y
STA W00,Y
DEY
BPL COPYCHUNK0 ; if hasn't rolled over to FF, loop to copy next byte.
***** if I'm on second pass, only do chunk0
; HASHPASS = 1, add to CURRENTCHUNK?
LDA HASHPASS
STA CURRENTCHUNK
***** if I'm on second pass, only do chunk0
JMP EXTENDWORDS ; done with chunk 0
NEXTCHUNK
**** Only does this (second chunk) on first pass. So CURRENTMESSAGE always points to MESSAGE (never MESSAGE2)
]chunkbyte = 64
LUP 64
COPYCHUNK1 LDA MESSAGE + ]chunkbyte
STA W00 - 64 + ]chunkbyte ;
]chunkbyte = ]chunkbyte + 1
--^
**** Only does this (second chunk) on first pass.
* Extend the first 16 words into the remaining 48 words w[16..63] of the message schedule array:
* for i from 16 to 63
* s0 = (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3)
* s0 = (XREGISTER32) xor (YREGISTER32) xor (INPUT32)
EXTENDWORDS
LDX #60 ; 15*4
EXTEND TXA
CLC
ADC #$04 ; increment A = 16
;;CMP #$40 ; compare to 64*4
BNE EXTEND2 ; done with EXTEND step (done through 63)
JMP INITIALIZE
EXTEND2 TAX
;;SEC ; set carry for subtract
;;SBC #$0F ; -15
LDXWR15 ; takes X as arg. load W[a-15] into XREGISTER32 and ROR32
RIGHTROTATEX7 LUP 6
RIGHTROTATEX32 ; ROR32 6 more times
--^
STA XREGISTER32
;;TAX32 ; should store partial result at XREGISTER32
RIGHTROTATE18 RIGHTROTATEXY8 ; copy from XREGISTER32 into YREGISTER32 and ROR32 9 times
LUP 2
RIGHTROTATEY32 ; ROR32 2 more times
--^
STA YREGISTER32
;;TAY32 ; should store partial result at YREGISTER32
; X still = X*4
LDAWR15 ; load W[a-15] into INPUT32
RIGHTSHIFT3 LUP 2
RIGHTSHIFT32 ; shift right, ignore carry
--^
; store partial result in INPUT32
* s0 = (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3)
* s0 = (XREGISTER32) xor (YREGISTER32) xor (INPUT32)
XORAXY32T0
; A32 -> TEMP0
;;STATEMP0
* s1 := (w[i- 2] rightrotate 17) xor (w[i- 2] rightrotate 19) xor (w[i- 2] rightshift 10)
;;SEC ; set carry for subtract
;;SBC #$02 ; -02
LDXWR2 ; load W14 into XREGISTER32 and ROR32 17 times
STA XREGISTER32
RIGHTROTATE17 TXYR32 ; copy XREGISTER32 to YREGISTER32 and ROR32
RIGHTROTATE2 RIGHTROTATEY32 ; ROR32 1 more time
STA YREGISTER32
;;TAY32 ; should store partial result at YREGISTER32
; ; X = X*4
LDAWS248 ; load W14 into INPUT32 and ROR32
RIGHTSHIFT10 ;;RIGHTSHIFT8
;;LUP 2
RIGHTSHIFT24 ; shift right, ignore carry
;;--^
; store partial result in INPUT32
* s1 := (w[i- 2] rightrotate 17) xor (w[i- 2] rightrotate 19) xor (w[i- 2] rightshift 10)
* s1 := (XREGISTER32) xor (YREGISTER32) xor (INPUT32)
* w[i] := w[i-16] + s0 + w[i-7] + s1
* w[i] := w[i-16] + TEMP0 + w[i-7] + INPUT32
* w[i] := w[i-16] + INPUT32 + w[i-7] + XREGISTER32
CLC
XORAXYADD24
;;SEC
;;SBC #$10 ; w[0]
; load W00 into pointer, add with X32, store to X32
LDWADDXX16 ; takes X
;;TAX32 ; transfer to XREGISTER32
;;SEC
;;SBC #$07 ; w[09]
; load W09 into pointer, add with X32
; store result in w[i]
LDWADDX7STA32 ; takes X, store in W16
STOREW ;;LDWSTA32 ; store in W16
JMP EXTEND ; repeat until i=63
INITIALIZE
* Initialize working variables to current hash value:
* Va := h00
* Vb := h01
* Vc := h02
* Vd := h03
* Ve := h04
* Vf := h05
* Vg := h06
* Vh := h07
HASHTOV
]bytenumber = 0
LUP 32
HTOV LDA H00 + ]bytenumber
STA VA + ]bytenumber
]bytenumber = ]bytenumber + 1
--^
**************************************************
* MAIN LOOP. OPTIMIZE THIS.
**************************************************
COMPRESSION
* Compression function main loop:
* for i from 0 to 63
LDA #$00
COMPRESS TAX
* S1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
LDVLDXR32 4 ; pointer to VE, ROR32
RIGHTROTATE06 LUP 5
RIGHTROTATEX32 ; shift right, ignore carry
--^
STA XREGISTER32
;;TAX32 ; result in XREGISTER32
TXYR32
RIGHTROTATE11 LUP 4
RIGHTROTATEY32 ; shift right 5 more times=11, ignore carry
--^
STA YREGISTER32
;;TAY32 ; result in YREGISTER32
RIGHTROTATE25 RIGHTROTATEYA8
LUP 5
RIGHTROTATEA32 ; shift right 14 more times=25, ignore carry
--^
* S1 := (XREGISTER32) xor (YREGISTER32) xor (INPUT32)
XORAXY32S1
;S1
;;STAS1 ; store INPUT32 in S1
**** CHOICE and MAJ always take the same 3 arguments - make macros
* ch := (e and f) xor ((not e) and g)
; CH in INPUT32
* temp1 := Vh + S1 + ch + k[i] + w[i] = TEMP0
CHOICE32ADD
; S1 + CH
;;LDSADC32 4 ; (S1 + ch) in INPUT32
; + VH
LDVHADC32
LDKADC32 ; K[i] in pointer
; + K[i]
LDWADCS0 ; W[i] in pointer
; + W[i]
; LDXADC32 ; (S1 + ch + VH + k[i] + w[i]) in INPUT32
; = TEMP0
;;STATEMP0 ; store temp1 at TEMP0
* S0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
LDVLDXR32 0 ; pointer to VA, ROR32
RIGHTROTATE02 ;;LUP 2
RIGHTROTATEX32 ; ROR 2 times
;;--^
STA XREGISTER32
;;TAX32 ; result in XREGISTER32
RIGHTROTATE13 RIGHTROTATEXY8
LUP 2
RIGHTROTATEY32 ; ROR 11 more times=13
--^
STA YREGISTER32
;;TAY32 ; result in YREGISTER32
RIGHTROTATE22 RIGHTROTATE8
RIGHTROTATEA32 ; ROR 9 more times=22
* S0 := (XREGISTER32) xor (YREGISTER32) xor (INPUT32)
XORAXY32S0
;S0
;;STAS0 ; store INPUT32 in S0
**** CHOICE and MAJ always take the same 3 arguments - make macros
* maj := (a and b) xor (a and c) xor (b and c)
* temp2 := S0 + maj
* temp2 := S0 + INPUT32
; load A,B,C into A32,X32,Y32
MAJ32ADDT1 ; MAJ in INPUT32
; load S0 into X32
;S0 -> X32
;;LDA STABLELO ; takes X as argument
;;STA $00
;;LDA STABLEHI
;;STA $01 ; now word/pointer at $0+$1 points to 32bit word at STABLE,X
;;LDX32 ; S0 in XREGISTER32
;;CLC
;;ADC32 ; TEMP2 in INPUT32
;A32 -> TEMP1
;;STATEMP1 ; temp2 to TEMP1
ROTATE
* Vh := Vg
* Vg := Vf
* Vf := Ve
; Store VG in VH
VXTOVY 6;7
VXTOVY 5;6
VXTOVY 4;5
* Ve := Vd + temp1
LDVADDT0STA 3
;TEMP0 -> X32
;;LDX TEMPLO
;;STX $00
;;LDX TEMPHI
;;STX $01 ; now word/pointer at $0+$1 points to TEMP0
;;LDXADC32
;;LDVSTA 4
* Vd := Vc
* Vc := Vb
* Vb := Va
VXTOVY 2;3
VXTOVY 1;2
VXTOVY 0;1
* Va := temp1 + temp2
;TEMP1 -> X32
;;LDX TEMPLO+1
;;STX $00
;;LDX TEMPHI+1
;;STX $01 ; now word/pointer at $0+$1 points to TEMP1
;;LDX32 ; load TEMP1 into XREGISTER32
;TEMP0 -> A32
LDATEMP0ADD 0
;;CLC
;;ADC32
;;LDVSTA 0
COMPRESSLOOP TXA ; Round 0-63 from stack
CLC
ADC #$04
;;CMP #$40
BEQ ADDHASH ; checks to see if we can skip or pull from cache
JMP COMPRESS
**************************************************
* END MAIN LOOP.
* FINALIZE HASH AND OUTPUT.
**************************************************
ADDHASH
* Add the compressed chunk to the current hash value:
* h0 := h0 + Va
* h1 := h1 + Vb
* h2 := h2 + Vc
* h3 := h3 + Vd
* h4 := h4 + Ve
* h5 := h5 + Vf
* h6 := h6 + Vg
* h7 := h7 + Vh
]varbyte = 0
LUP 8
CLC
LDA H00+3 + ]varbyte
ADC VA+3 + ]varbyte
STA H00+3 + ]varbyte
LDA H00+2 + ]varbyte
ADC VA+2 + ]varbyte
STA H00+2 + ]varbyte
LDA H00+1 + ]varbyte
ADC VA+1 + ]varbyte
STA H00+1 + ]varbyte
LDA H00 + ]varbyte
ADC VA + ]varbyte
STA H00 + ]varbyte
]varbyte = ]varbyte + 4
--^
; if HASHCACHED == 0
; AND chunk=0 AND pass=0
; then write to CACHEDHASH
CHECKCHUNK LDA CURRENTCHUNK
BNE CHECKPASS ; did I just do chunk 0? INC and go back and do second chunk.
INC CURRENTCHUNK ; set to chunk 1
LDA HASHCACHED ; has chunk0 pass0 already done?
BEQ HASHTOCACHE ; otherwise
JMP COPYCHUNKS ;
CHECKPASS LDA HASHPASS ; pass 0? set the message to the hash output and go again
BEQ INCHASHPASS ; pass 1, skip to digest.
JMP DIGEST
INCHASHPASS INC HASHPASS ;
JMP HASHTOMESSAGE
HASHTOCACHE
]cachebyte = 0
LUP 32
LDA H00 + ]cachebyte
STA CACHEDHASH + ]cachebyte
]cachebyte = ]cachebyte+1
--^
INC HASHCACHED ; don't repeat.
JMP COPYCHUNKS ;
HASHTOMESSAGE
; for each of 32 bytes, Y
; load byte from H00,Y
; store at MESSAGE2,Y
COPYHASH
]hashbyte = 31
LUP 32
LDA H00 + ]hashbyte
STA MESSAGE2 + ]hashbyte
]hashbyte = ]hashbyte - 1
--^
LDA #<MESSAGE2
STA CURRENTMESSAGELO
LDA #>MESSAGE2
STA CURRENTMESSAGEHI
******* only need one chunk for message2
LDA #$00
STA CURRENTCHUNK
JMP INITIALIZEHASH ; re-initializes the original sqrt hash values for pass 2
DIGEST ; done the thing.
LDA #$06 ; set the memory location for line $14.
STA $29 ;
LDA #$50 ;
STA $28 ;
LDY #$00 ; 0
PRNONCE
]hashbyte = 0
LUP 4
LDX NONCE + ]hashbyte ; load from table pointer
PRHEX ; PRBYTE - clobbers Y
;**** ROLL MY OWN?
]hashbyte = ]hashbyte + 1
--^
INCNONCE
INC $29 ; 0650 -> 0750
PRDIGEST
LDX H00
BNE PRBYTE1
; if zero, spin the coin
JSR FLIPCOIN
LDX H00
PRBYTE1 LDY #$00 ; 0
PRHEX
]hashbyte = 1
LUP 19
LDX H00 + ]hashbyte
PRHEX
]hashbyte = ]hashbyte + 1
--^
NEXTLINE LDA #$D0
STA $28 ; $0750 to $07D0
LDY #$00 ; 0
]hashbyte = 20
LUP 12
LDX H00 + ]hashbyte
PRHEX
]hashbyte = ]hashbyte + 1
--^
JMP PREPROCESS ; INC NONCE, start over.
DONEWORK ; processed all the 2^32 nonce values. WTF?
RTS
**************************************************
* macros (expanded at assembly time)
**************************************************
;;LDW MAC
;; LDA WTABLELO,X ; takes X as argument
;; STA $00
;; LDA WTABLEHI,X
;; STA $01 ; now word/pointer at $0+$1 points to 32bit word at WTABLE,X
;; <<< ; End of Macro
;;LDK MAC
;; LDA KTABLELO,X ; takes X as argument
;; STA $00
;; LDA KTABLEHI,X
;; STA $01 ; now word/pointer at $0+$1 points to 32bit word at KTABLE,X
;; <<< ; End of Macro
; LDH MAC
; LDA HTABLELO,X ; takes X as argument
; STA $00
; LDA HTABLEHI,X
; STA $01 ; now word/pointer at $0+$1 points to 32bit word at HTABLE,X
; <<< ; End of Macro
;;LDV MAC
;; LDA VTABLELO,X ; takes X as argument
;; STA $00
;; LDA VTABLEHI,X
;; STA $01 ; now word/pointer at $0+$1 points to 32bit word at VTABLE,X
;; <<< ; End of Macro
;;LDVV MAC
;; LDA VTABLELO+]1 ; takes X as argument
;; STA $00
;; LDA VTABLEHI+]1
;; STA $01 ; now word/pointer at $0+$1 points to 32bit word at VTABLE,X
;; <<< ; End of Macro
LDVLDXR32 MAC
LDA VA + ]1 + ]1 + ]1 + ]1 ; load from table pointer
LSR
STA XREGISTER32 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 +1 ; load from table pointer
ROR
STA XREGISTER32+1 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 +2 ; load from table pointer
ROR
STA XREGISTER32+2 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 +3 ; load from table pointer
ROR
STA XREGISTER32+3 ; store in 32 bit "accumulator"
LDA #$00 ; accumulator to 0
ROR ; CARRY into bit7
ORA XREGISTER32 ; acccumulator bit7 into BIT31
<<< ; End of Macro
LDVADDT0STA MAC
CLC
LDA VA + ]1 + ]1 + ]1 + ]1 +3 ; load from table pointer
ADC TEMP0 +3
STA VA + 16 +3 ; load from table pointer
LDA VA + ]1 + ]1 + ]1 + ]1 +2 ; load from table pointer
ADC TEMP0 +2
STA VA + 16 +2 ; load from table pointer
LDA VA + ]1 + ]1 + ]1 + ]1 +1 ; load from table pointer
ADC TEMP0 +1
STA VA + 16 +1 ; load from table pointer
LDA VA + ]1 + ]1 + ]1 + ]1 ; load from table pointer
ADC TEMP0
STA VA + 16 ; load from table pointer
<<< ; End of Macro
LDVLDX MAC
LDA VA + ]1 + ]1 + ]1 + ]1 +3 ; load from table pointer
STA XREGISTER32+3 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 +2 ; load from table pointer
STA XREGISTER32+2 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 +1 ; load from table pointer
STA XREGISTER32+1 ; store in 32 bit "accumulator"
LDA VA + ]1 + ]1 + ]1 + ]1 ; load from table pointer
STA XREGISTER32 ; store in 32 bit "accumulator"
<<< ; End of Macro
LDVSTA MAC
LDA INPUT32+3 ; store in 32 bit "accumulator"
STA VA + ]1 + ]1 + ]1 + ]1 +3 ; load from table pointer
LDA INPUT32+2 ; store in 32 bit "accumulator"
STA VA + ]1 + ]1 + ]1 + ]1 +2 ; load from table pointer
LDA INPUT32+1 ; store in 32 bit "accumulator"
STA VA + ]1 + ]1 + ]1 + ]1 +1 ; load from table pointer
LDA INPUT32 ; store in 32 bit "accumulator"
STA VA + ]1 + ]1 + ]1 + ]1 ; load from table pointer
<<< ; End of Macro
VXTOVY MAC ; rotate Vn to Vn-1
LDA VA + ]1+ ]1+ ]1+ ]1 ; load from table pointer
STA VA + ]2+ ]2+ ]2+ ]2 ; store in table pointer
LDA VA + ]1+ ]1+ ]1+ ]1 + 1 ; load from table pointer
STA VA + ]2+ ]2+ ]2+ ]2 + 1 ; store in table pointer
LDA VA + ]1+ ]1+ ]1+ ]1 + 2 ; load from table pointer
STA VA + ]2+ ]2+ ]2+ ]2 + 2 ; store in table pointer
LDA VA + ]1+ ]1+ ]1+ ]1 + 3 ; load from table pointer
STA VA + ]2+ ]2+ ]2+ ]2 + 3 ; store in table pointer
<<< ; End of Macro
LDXWR15 MAC ; X indicates which W0x word to read from
LDA W00 - 60,X ; load from table pointer
LSR
STA XREGISTER32 ; store in 32 bit "accumulator"
LDA W00 + 1 - 60,X ; load from table pointer
ROR
STA XREGISTER32+1 ; store in 32 bit "accumulator"
LDA W00 + 2 - 60,X ; load from table pointer
ROR
STA XREGISTER32+2 ; store in 32 bit "accumulator"
LDA W00 + 3 - 60,X ; load from table pointer
ROR
STA XREGISTER32+3 ; store in 32 bit "accumulator"
LDA #$00 ; accumulator to 0
ROR ; CARRY into bit7
ORA XREGISTER32 ; acccumulator bit7 into BIT31
<<<
LDXWR2 MAC ; X indicates which W0x word to read from
LDA W00 + 2 - 8,X ; load from table pointer
LSR
STA XREGISTER32 ; store in 32 bit "accumulator"
LDA W00 + 3 - 8,X ; load from table pointer
ROR
STA XREGISTER32+1 ; store in 32 bit "accumulator"
LDA W00 - 8,X ; load from table pointer
ROR
STA XREGISTER32+2 ; store in 32 bit "accumulator"
LDA W00 + 1 - 8,X ; load from table pointer
ROR
STA XREGISTER32+3 ; store in 32 bit "accumulator"
LDA #$00 ; accumulator to 0
ROR ; CARRY into bit7
ORA XREGISTER32 ; acccumulator bit7 into BIT31
<<<
LDAWR15 MAC ; X indicates which W0x word to read from
LDA W00 - 60,X ; load from table pointer
LSR
STA INPUT32 ; store in 32 bit "accumulator"
LDA W00 + 1 - 60,X ; load from table pointer
ROR
STA INPUT32+1 ; store in 32 bit "accumulator"
LDA W00 + 2 - 60,X ; load from table pointer
ROR
STA INPUT32+2 ; store in 32 bit "accumulator"
LDA W00 + 3 - 60,X ; load from table pointer
ROR
<<<
LDAW MAC ; X indicates which W0x word to read from
LDA W00 + 3,X ; load from table pointer
STA INPUT32+3 ; store in 32 bit "accumulator"
LDA W00 + 2,X ; load from table pointer
STA INPUT32+2 ; store in 32 bit "accumulator"
LDA W00 + 1,X ; load from table pointer
STA INPUT32+1 ; store in 32 bit "accumulator"
LDA W00,X ; load from table pointer
STA INPUT32 ; store in 32 bit "accumulator"
<<<
LDAWS248 MAC ; X indicates which W0x word to read from
LDA W00 - 8,X ; load from table pointer
LSR
STA INPUT32+1 ; store in 32 bit "accumulator"
LDA W00 + 1 - 8,X ; load from table pointer
ROR
STA INPUT32+2 ; store in 32 bit "accumulator"
LDA W00 + 2 - 8,X ; load from table pointer
ROR
<<<
LDWSTA32 MAC ; store INPUT32 in W0x word
LDA INPUT32+3 ; load from 32 bit "accumulator"
STA W00 + 3,X ; store in table pointer
LDA INPUT32+2 ; load from 32 bit "accumulator"
STA W00 + 2,X ; store in table pointer
LDA INPUT32+1 ; load from 32 bit "accumulator"
STA W00 + 1,X ; store in table pointer
LDA INPUT32 ; load from 32 bit "accumulator"
STA W00,X ; store in table pointer
<<<
STA32 MAC ; puts 4 bytes from 32 bit "accumulator" INPUT32 into ($01,$00), clobbers A,Y
LDY #$03
LDA INPUT32+3 ; load from 32 bit "accumulator"
STA ($0),Y ; store in table pointer
LDY #$02
LDA INPUT32+2 ; load from 32 bit "accumulator"
STA ($0),Y ; store in table pointer
LDY #$01
LDA INPUT32+1 ; load from 32 bit "accumulator"
STA ($0),Y ; store in table pointer
LDY #$00
LDA INPUT32 ; load from 32 bit "accumulator"