dos33fsprogs/tfv/OPTIMIZATION
Vince Weaver 91706259a7 tfv: use the re-arranged multiply register results to optimize
cycle count actually gets worse, but that was due to a bug in
the cycle counting missing two of the add routines
2017-11-30 00:37:57 -05:00

231 lines
7.0 KiB
Plaintext

Original implementation:
Multiplying 1.0 * 2.0 = 2.0, took 707 cycles
Multiplying ff.ff * ff.ff = 0.0, took 761 cycles
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 88,179
Cycles: mode7= 76,077
Cycles: lookup_map= 33,920
Cycles: put_sprite= 2,561
==================================
Total = 200,971
Frame Rate = 4.98 fps
Update Multiply to use zero page addresses:
Multiplying 1.0 * 2.0 = 2.0, took 616 cycles
Multiplying ff.ff * ff.ff = 0.0, took 664 cycles
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 76,561
Cycles: mode7= 76,077
Cycles: lookup_map= 33,920
Cycles: put_sprite= 2,561
===================================
Total = 189,353
Frame Rate = 5.28 fps
Update to use "fast multiply" w 2kB squares table lookup:
Multiplying 1.0 * 2.0 = 2.0, took 228 cycles
Multiplying ff.ff * ff.ff = 0.0, took 272 cycles
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 27,041
Cycles: mode7= 76,077
Cycles: lookup_map= 33,920
Cycles: put_sprite= 2,561
=================================
Total = 139,833
Frame Rate = 7.15 fps
Update to optimize fast multiply (reusing NUM1H, return results in register)
Multiplying 1.0 * 2.0 = 2.0, took 234 cycles
Multiplying ff.ff * ff.ff = 0.0, took 278 cycles
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 24,935
Cycles: mode7= 73,925
Cycles: lookup_map= 33,920
Cycles: put_sprite= 2,561
=================================
Total = 135,575
Frame Rate = 7.38 fps
Add a cache to lookup_map
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 24,935
Cycles: mode7= 73,445
Cycles: lookup_map= 24,649
Cycles: put_sprite= 2,561
=================================
Total = 125,824
Frame Rate = 7.95 fps
Don't draw sky every frame
Cycles: flying= 162
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 24,935
Cycles: mode7= 69,099
Cycles: lookup_map= 24,649
Cycles: put_sprite= 2,561
=================================
Total = 121,478
Frame Rate = 8.23 fps
Move checking if over water out of critical section
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 24,935
Cycles: mode7= 54,374
Cycles: lookup_map= 24,712
Cycles: put_sprite= 2,561
=================================
Total = 106,841
Frame Rate = 9.36 fps
Move to 40x40 mode
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 49,613
Cycles: mode7= 123,701
Cycles: lookup_map= 48,847
Cycles: put_sprite= 2,561
=================================
Total = 224,981
Frame Rate = 4.44 fps
Remove some unnecessary zero page copies in the mode7 code
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 49,613
Cycles: mode7= 141,269
Cycles: lookup_map= 21,718
Cycles: put_sprite= 2,561
================================
Total = 215,420
Frame Rate = 4.64 fps
A few more minor cleanups in the Y loop
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 49,613
Cycles: mode7= 140,858
Cycles: lookup_map= 21,718
Cycles: put_sprite= 2,561
===============================
Total = 215,009
Frame Rate = 4.65 fps
Add some self-modifying code to inner loop:
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 49,613
Cycles: mode7= 131,610
Cycles: lookup_map= 21,718
Cycles: put_sprite= 2,561
================================
Total = 205,761
Frame Rate = 4.86 fps
More self-modifying code, also move SCREEN_X to X register
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 49,613
Cycles: mode7= 118,034
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 193,214
Frame Rate = 5.18 fps
Remove unneeded precision in the 8.8 x 8.8 fixed point multiply
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 43,588
Cycles: mode7= 118,034
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 187,189
Frame Rate = 5.34 fps
In-line unsigned multiply inside of signed multiply (save 12 cycles)
Multiplying 1.0 * 2.0 = 2.0, took 198 cycles
Multiplying ff.ff * ff.ff = 0.0, took 218 cycles
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 40,888
Cycles: mode7= 118,034
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 184,489
Frame Rate = 5.42 fps
Have loop counter count down from 40 instead of count up (avoid compare)
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 40,888
Cycles: mode7= 115,538
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 181,993
Frame Rate = 5.49 fps
Move spacez updates out of line and also do some self modifying code
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 40,680
Cycles: mode7= 114,830
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 181,077
Frame Rate = 5.52 fps
Re-arranged multiply result register to allow more optimization.
This looks like a pessimization, but it's because the cycle counting code
had been undercounting and missed a few add routines :(
Cycles: flying= 187
Cycles: getkey= 46
Cycles: page_flip= 26
Cycles: multiply= 40,680
Cycles: mode7= 115,470
Cycles: lookup_map= 22,747
Cycles: put_sprite= 2,561
================================
Total = 181,717
Frame Rate = 5.50 fps
Each cycle removed from inner X loop saves
32*40=1280 cycles