Original implementation: Multiplying 1.0 * 2.0 = 2.0, took 707 cycles Multiplying ff.ff * ff.ff = 0.0, took 761 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 88,179 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================== Total = 200,971 Frame Rate = 4.98 fps Update Multiply to use zero page addresses: Multiplying 1.0 * 2.0 = 2.0, took 616 cycles Multiplying ff.ff * ff.ff = 0.0, took 664 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 76,561 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 =================================== Total = 189,353 Frame Rate = 5.28 fps Update to use "fast multiply" w 2kB squares table lookup: Multiplying 1.0 * 2.0 = 2.0, took 228 cycles Multiplying ff.ff * ff.ff = 0.0, took 272 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 27,041 Cycles: mode7= 76,077 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================= Total = 139,833 Frame Rate = 7.15 fps Update to optimize fast multiply (reusing NUM1H, return results in register) Multiplying 1.0 * 2.0 = 2.0, took 234 cycles Multiplying ff.ff * ff.ff = 0.0, took 278 cycles Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 73,925 Cycles: lookup_map= 33,920 Cycles: put_sprite= 2,561 ================================= Total = 135,575 Frame Rate = 7.38 fps Add a cache to lookup_map Cycles: flying= 162 Cycles: getkey= 46 Cycles: page_flip= 26 Cycles: multiply= 24,935 Cycles: mode7= 73,445 Cycles: lookup_map= 24,649 Cycles: put_sprite= 2,561 ================================= Total = 125,824 Frame Rate = 7.95 fps