Challenges found writing an 8k Lores Apple II Demo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   by DEATER (Vince Weaver, vince@deater.net)

  http://www.deater.net/weave/vmwprod/mode7_demo/
====================================================
                19 March 2018

GOAL:
~~~~~
  This started out as some SNES style mode7 pseudo-3d graphics code
  I came up with while working on my TF7 game.  The graphics looked
  pretty cool, so I started developing a demo around it.

  The codesize ended up being roughly around 8kB, so I thought I'd
  make it into an 8k demo.  There aren't many out there for the Apple II.
  and a Mockingboard sound card.

  The demo tries to hit the lowest common denominator for Apple II systems,
  so in theory you could have run this on an Apple II in 1977 if you
  were rich enough to afford 48k of RAM.  The Mockingboard sound wasn't
  available until 1981, but still this all predates the Commodore 64.

USING:
~~~~~~
   Boot disk on a real system, or emulator with Mockingboard support.

   Applewin works fine (even under Wine on Linux).
   MESS does too, it's harder to setup (ROMs) but the audio sounds clearer.

   If you have no emulator you can try one of the online javascript ones.
	https://www.scullinsteel.com/apple2/


Hardware:
~~~~~~~~~
	The Apple II has a 6502 processor running at roughly 1.023MHz.

	Early models only shipped with 4k of RAM, but later 48k, 64k, and 128k
	systems were common.

	The most common disk drive was the Disk II which typically held
	140k of data (single-sided).

	The only sound available was a bit-banged speaker.  No timer,
	if you wanted music you had to cycle-count via the CPU.

	Later some sound cards were available.  This demo uses the
	Mockingboard which has dual AY-3-8910 sound chips.  Each
	chip provides 3 channels of square waves, with noise and
	envelope effects available.

	GRAPHICS
	~~~~~~~~

	The Apple II had nice graphics for its time, with this time being
	around 1977.  Otherwise it is quite limited.
		Hardware Sprites?	No
		Linear framebuffer?	No
		User-defined charset?	No
		Blanking interrupts?	No
		Palette selection?	No
		Hardware scrolling?	No
		Hardware page flip?	Yes

	The hi-res graphics mode was a complex mess of NTSC hacks by Woz.
	You got 280x192 graphics, with 6 colors available.  However the colors
	were from NTSC artifacts and there were limitations on which colors
	could be next to each other (in blocks of 3.5 pixels) as well as
	fringing.  Also the addresses were interleaved, so not a linear
	framebuffer.  Hi-res page0 is at $2000 and page1 at $4000.
	Optionally 4 lines of text can be shown at the bottom of the
	screen instead of graphics.

	The lo-res mode is a bit easier to use.  It is 40x48 blocks
	(40x40 if 4 lines of text are displayed at the bottom).
	15 colors are available, though there is fringing at the edges.
	Again the addresses are interleaved.  Lo-res page0 is at $400
	and page1 is at $800.

========================================
DETAILED STEP-BY-STEP REVIEW OF THE DEMO
========================================

 BOOTLOADER
 ~~~~~~~~~~
   A BASIC "HELLO" program loads the binary.
   This just makes things auto-boot at startup, this doesn't count
   towards the executable size, you could manually BRUN the 8k program
   if you wanted.

   The binary is loaded at $2000 (hi-res page0) and BASIC kicks into
   HIRES mode before loading so you can watch as the memory is loaded
   from disk in a seemingly random pattern.

   Since this is an 8k demo, the entirety of the program is shown on
   the screen (or would be if we POKEd the right address to turn off
   the 4 lines of text on the bottom of the screen).

   Execution starts at address $2000

 DECOMPRESSER
 ~~~~~~~~~~~~
   The binary is LZ4 encoded.  The decompresser flips to HGR page 1 so
   we can watch memory as the program is decompressed.

   The LZ4 decompression code was written by qkumba (Peter Ferrie).
	http://pferrie.host22.com/misc/appleii.htm

   The actual program/data decompresses to around 22k starting at $4000.
   It over-writes parts of DOS3.3, but since we won't be using the disk 
   anymore this isn't an issue.

   At the top left corner of the screen you'll see the VMW triangles logo
   as it decompresses.  To do this I had to put the proper bit pattern
   at $4000, $4400, $4800, and $4C00.  I mean to have some words too
   but ran out of disk space.  The bit pattern at $4000 is executable
   and is run as code.

   Optimizing for code size inside of a compressed binary is a pain.
   Removing instructions sometimes made the binary larger as it no longer
   compressed as well.  Long runs of values (such as 0 padding) are 
   essentially free.  This was a difficult challenge.

FADE EFFECT
~~~~~~~~~~~
  The title screen fades in from black.

  This is a software hack, with a lookup table copying from an off-screen
  buffer.  The Apple II doesn't have any palette support.

TITLE SCREEN
~~~~~~~~~~~~
   Once things are decompressed, we jump to $4000.  We switch to low-res
   mode for the rest of the DEMO.

   A background image is loaded from disk.  This is RLE encoded (probably
   unnecessary when being further LZ4 encoded).

   Why not just load the program at $400 and load the graphics image for
   free?  Well, remember the graphics are 40x48 (shared with the text).
   Really it's 40x24, with each text char mapping to 4-bits top/bottom
   for color.  Do the math, we have 1k reserved for this mode but 40x24
   is only 960 bytes.  It turns out there are "holes" in the address range
   that aren't displayed, and various pieces of hardware use these holes
   as scratchpad memory.  So if you just blindly uncompress graphics data
   there you can corrupt the scratchpad.  So you have to be careful
   when uncompressing to skip the holes.

   The title screen has scrolling text at the bottom.  This is nothing fancy,
   the text is in a buffer off screen and a 40x4 chunk of RAM is copied in
   every so many cycles.

   You might notice that there is tearing/jitter in the scrolling, even
   though we are double-buffering the graphics.  This is because there is
   not a reliable cross-platform way to get the VBLANK info (especially
   on older machines) so we are having some bad luck about when we flip
   pages.

MOCKINGBOARD MUSIC
~~~~~~~~~~~~~~~~~~
   I like chiptune music, especially that for AY-3-8910 based systems.
   Before obtaining a Mockingboard I built a Raspberry Pi chiptune player
   that is essentially the same hardware.

   Most of my sound infrastructure involves YM5 files, which are often used
   by ZX Spectrum and ATARI ST users.  These are usually register dumps
   taken typically at 50Hz.  So to play them back you just have to interrupt
   50 times a second and write the registers.

   To program the Mockingboard, each AY-3-8910 chip has 14 sound related
   registers that control the 3 channels.  Each AY chip has a dedicated
   VIA 6522 parallel I/O chip that handles the I/O.

   Doing this quickly enough is a challenge on the Apple II.  For each
   register you have to do a handshake, set the register # and the value.
   This can take upwards of 40 1MHz cycles per register.

   For complex chiptune files (especially those written on an ST with much
   faster hardware) it's sometimes not possible to get exact playback
   due to the delay.  Also one AY is on the left channel and one on the right
   so you have to write both if you want sound from both speakers.

   I have a whole suite of code for manipulating YM sound data, in my
   vmw-meter git repository.

   The first step for getting this to work is detecting if a mockingboard is
   there.  This can be in any slot 1-7 on the Apple II, though typically
   Slot 4 is standard (in this demo we only check slot 4).

   The board is initialized, and then one of the 6522 timers is set to
   interrupt at 25Hz (it has to be an on-board timer as the default
   Apple II has no timers).

   Why 25Hz and not 50Hz?  At 50Hz with 14 registers you use 700 bytes/s.
   So a 2 minute song would take 84k of RAM, much more than is available.

   For this demo I run at 25Hz, and also pack the 14 registers of the data
   into 11 (there are various fields that are not packed well, we can
   unpack at play time).  Also I stripped out the envelope data as many
   songs do not use it (so this is a lossy compression method).

   Also, we keep track of the last values written last frame and only
   write out to the board if things change, which helps with the latency
   a bit.

   The sound quality suffered a bit, but it's hard to fit a catchy chiptune
   file in 8K.

   The song being played is a stripped down and re-arranged version of
   "Electric Wave" from CC'00 by EA (Ilya Abrosimov). 


MODE7 BACKGROUND
~~~~~~~~~~~~~~~~
  "MODE7" was a Super Nintendo (SNES) graphics mode that took a tiled
  background and transformed it to look as if it was squashed out to
  the horizon, giving a 3d look.  The SNES did this in hardware, but
  in this demo we do this in software.

  As found on Wikipedia, the transform is of the type

  [x'] = [a b]([x]-[x0])+[x0]
  [y']   [c d]([y] [y0]) [y0]
  
  For our code, we managed to reduce things to a small number of additions
  and subtractions for each pixel on the screen.  Of course the 6502 can't
  do floating point, so we do fixed point math.  We convert as much as we
  can to table lookups that are pre-calculated.  We also make liberal use
  of self-modifying code.

  Despite all of this there are still some cases where we have to do a 
  16bit x 16bit = 32bit multiply, something that is *really* slow on 6502,
  around 700 cycles (for a 8.8 x 8.8 fixed point multiply).

  To make this faster we use a method described by Stephen Judd.

  The key to note is that (a+b)^2 = a^2+2ab+b^2 and (a-b)^2=a^2-2ab+b^2
  and if you add them you can simplify to:
		(a+b)^2      (a-b)^2
	a*b =  ---------  -  -------
                   4            4
  This is you have a table of squares from 0..511 (all 8-bit a+b and a-b
  will fall in this range) then you can convert a multiply into a table
  lookup plus a subtract.

  The downsize is you will need 2kB of squares lookup tables (which can
  be generated at startup).  This reduces the multiply cost to the order
  of 200 to 250 cycles.

  By using the fast multiply and a lot of careful optimization you can
  generate a Mode7 background in 40x40 graphics mode at about 5 frames/second.

  The engine can be parameterized with different tilesets to use, which we
  do to provide both a black+white checkerboard background, as well as the
  island background from the TFV game.

BOUNCING BALL ON CHECKERBOARD
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  What would a demo be without some sort of bouncing geometric shape.

  This is just done with 16 sprites.  The sphere was modeled in OpenGL
  from a 2000-era game-engine that I never finished.  I then took screenshots
  and then reduced the size/color to an appropriate value.

  The shadow is also just sprites.

  The clicking noise on bounce is just touching the speaker at $C030.
  It's mostly there to give some sound effects for those playing the demo
  without a mockingboard.

TFV SPACESHIP FLYING
~~~~~~~~~~~~~~~~~~~~
  The spaceship, water splash, and shadows are all sprites.  This is all
  done in software, the Apple II has no sprite hardware.

  This is the TFV game engine flying-spaceship code, with the keyboard
  routines replaced to read from memory instead (sort of like a script
  of what to do when). 

STARFIELD
~~~~~~~~~
  The starfield is your typical starfield code.  Only 16 stars are modeled.
  It re-uses the fast-multiply code from the mode7 graphics.

  Random number generation is not fast on the 6502, so we cheat.
  Originally we had a 256-byte blob of "random" values generated earlier.

  This wasted space, so now instead we just treat the executable code
  at $5000 as if it were a block of random numbers.  This was arbitrarily
  chosen, I tried different areas of memory until I got one where the
  stars seemed to move in a pleasing pattern.

  A simple state machine controls if the stars move or not, whether the
  background is cleared or not (the streak effect) and what color the
  background is (for the blue flash).

  The ship moving to the distance is just done with different sized sprites.

RASTERBARS/CREDITS
~~~~~~~~~~~~~~~~~~

  The credits happen with the starfield continuing to run.

  The text is written in the bottom 4 lines of the screen.  Some inverse-mode
  space characters are used to try to make it look like graphics are surrounding
  the text.  It's actually possible with careful cycle counting to switch
  modes fast enough to have actual mixed graphics/text (See the FrenchTouch
  demos) but I was too lazy to attempt that here.

  The rasterbar effect isn't really rasterbars, it's just a rainbow assortment
  of lines being drawn with a SINEWAVE lookup table.

  It's the same rasterbar code from my chiptune player demo.  I ended up
  optimizing it a lot via inlining and a few other ways because it turned
  out just drawing a horizontal line can take a very long time.

  The rotating text is just taking the output string and rapidly rotating the
  character values through the ASCII table.

  The annoying clicking noise is the same speaker effect caused by hitting
  $C030.

  Choosing who to thank ended up being extremely critical to fitting in 8kB,
  as unique text strings do not compress well.  I'm also still not satisfied
  with how the centering looks.



Memory Map
==========

(not to scale)

 --------  $ffff
| ROM/IO |
 --------  $c000
|        |      32k decompress
 --------  $4000
|  load  |      8k
 --------  $2000
|  free  |
 --------  $1c00
| Scroll |
|  Data  |
 --------  $1800
|Multiply|
| Tables |	
 --------  $1000
|GR pg 2 |	1k
|--------  $0c00
|GR pg 1 |	1k
|--------  $0800
|GR pg 0 |	1k
 --------  $0400
|        |	0.5
 --------  $0200
| stack  |	0.25
 --------  $0100
|zero pg |	0.25
 -------   $0000