Original implementation: Multiplying 1.0 * 2.0 = 2.0, took 707 cycles Multiplying ff.ff * ff.ff = 0.0, took 761 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 88,179 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================== Total = 200,971 Frame Rate = 4.98 fps Update Multiply to use zero page addresses: Multiplying 1.0 * 2.0 = 2.0, took 616 cycles Multiplying ff.ff * ff.ff = 0.0, took 664 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 76,561 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 =================================== Total = 189,353 Frame Rate = 5.28 fps Update to use "fast multiply" w 2kB squares table lookup: Multiplying 1.0 * 2.0 = 2.0, took 228 cycles Multiplying ff.ff * ff.ff = 0.0, took 272 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 27,041 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================= Total = 139,833 Frame Rate = 7.15 fps Update to optimize fast multiply (reusing NUM1H, return results in register) Multiplying 1.0 * 2.0 = 2.0, took 234 cycles Multiplying ff.ff * ff.ff = 0.0, took 278 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 73,925 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================= Total = 135,575 Frame Rate = 7.38 fps Add a cache to lookup_map Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 73,445 Cycles: lookup_map= 24,649 Cycles: put_sprite= 2,561 ================================= Total = 125,824 Frame Rate = 7.95 fps Don't draw sky every frame Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 69,099 Cycles: lookup_map= 24,649 Cycles: put_sprite= 2,561 ================================= Total = 121,478 Frame Rate = 8.23 fps Move checking if over water out of critical section Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 54,374 Cycles: lookup_map= 24,712 Cycles: put_sprite= 2,561 ================================= Total = 106,841 Frame Rate = 9.36 fps Move to 40x40 mode Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 49,613 Cycles: mode7= 123,701 Cycles: lookup_map= 48,847 Cycles: put_sprite= 2,561 ================================= Total = 224,981 Frame Rate = 4.44 fps Remove some unnecessary zero page copies in the mode7 code Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 49,613 Cycles: mode7= 141,269 Cycles: lookup_map= 21,718 Cycles: put_sprite= 2,561 ================================ Total = 215,420 Frame Rate = 4.64 fps A few more minor cleanups in the Y loop Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 49,613 Cycles: mode7= 140,858 Cycles: lookup_map= 21,718 Cycles: put_sprite= 2,561 =============================== Total = 215,009 Frame Rate = 4.65 fps Add some self-modifying code to inner loop: Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 49,613 Cycles: mode7= 131,610 Cycles: lookup_map= 21,718 Cycles: put_sprite= 2,561 ================================ Total = 205,761 Frame Rate = 4.86 fps More self-modifying code, also move SCREEN_X to X register Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 49,613 Cycles: mode7= 118,034 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 193,214 Frame Rate = 5.18 fps Remove unneeded precision in the 8.8 x 8.8 fixed point multiply Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 43,588 Cycles: mode7= 118,034 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 187,189 Frame Rate = 5.34 fps In-line unsigned multiply inside of signed multiply (save 12 cycles) Multiplying 1.0 * 2.0 = 2.0, took 198 cycles Multiplying ff.ff * ff.ff = 0.0, took 218 cycles Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 40,888 Cycles: mode7= 118,034 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 184,489 Frame Rate = 5.42 fps Have loop counter count down from 40 instead of count up (avoid compare) Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 40,888 Cycles: mode7= 115,538 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 181,993 Frame Rate = 5.49 fps Move spacez updates out of line and also do some self modifying code Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 40,680 Cycles: mode7= 114,830 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 181,077 Frame Rate = 5.52 fps Re-arranged multiply result register to allow more optimization. This looks like a pessimization, but it's because the cycle counting code had been undercounting and missed a few add routines :( Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 40,680 Cycles: mode7= 115,150 Cycles: lookup_map= 22,747 Cycles: put_sprite= 2,561 ================================ Total = 181,397 Frame Rate = 5.51 fps Make lookup_map inline. Again it looks like an impressive speedup but some of this was fixing the cycle count estimates. Cycles: flying= 187 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 40,680 Cycles: mode7= 111,882 Cycles: lookup_map= 19,872 Cycles: put_sprite= 2,561 ================================ Total = 175,254 Frame Rate = 5.71 fps Each cycle removed from inner X loop saves 32*40=1280 cycles