\title{Making an 8k Low-resolution Graphics Demo for the Apple II}
\author{by DEATER, AKA Vincent M. Weaver}
\section{Why would anyone do this?}
While making an inside-joke filled game for my retro system of choice,
the Apple II, I needed to create a Final-Fantasy-esque
flying-over-the-planet sequence.
I was originally going to fake this, but why fake graphics when you
can laboriously spend weeks implementing the effect for real.
It turns out the Apple II is just barely capable of generating
the effect in real time.
Once I got the code working I realized it would be great as part of a
graphical demo, so off on that tangent I went.
This went well, despite the fact that all I knew about the demoscene I
had learned from a few viewings of the Future Crew {\em Second Reality} demo
combined with dimly remembered Commodore 64 and Amiga usenet flamewars.
While I hope you enjoy the description of the demo and the work that
went into it, I suspect this whole enterprise is primarily of note
due to the dearth of demos for the Apple II platform.
If you are truly interested in seeing impressive Apple II demos,
I would like to make a shout out to FrenchTouch whose works
put this one to shame.
\section{The Hardware}
The Apple II was introduced in 1977.
In theory this demo will run on hardware this old, although I do
not have access to a system of that vintage.
I like to troll Commodore fans by noting this predates the Commodore 64 by
five years.
{\bf CPU, RAM and Storage:}
The Apple II has a 6502 processor running at roughly 1.023MHz.
Early models only shipped with 4k of RAM, but later 48k, 64k, and 128k
systems were common.
While the demo itself fits in 8k, it decompresses to a larger size and uses
a full 48k of RAM;
this would have been very expensive in 1977.
See Figure~\ref{fig:map} for a diagram of the memory map.
Also in 1977 you would probably be loading this from cassette tape.
It would be another year before Woz's single-sided
$5\frac{1}{4}$" Disk II came about (eventually offering 140k of
storage per side with the release of Apple DOS3.3 in 1980).
{\bf Sound:}
The only sound available in a stock Apple II is a bit-banged speaker.
There was no timer interrupt; if you wanted music you had to cycle-count
via the CPU to get the waveforms you needed.
The demo uses a Mockingboard soundcard which was introduced in 1981.
This board contains dual AY-3-8910 sound generation chips connected via
6522 I/O chips.
Each sound chip provides 3 channels of square waves as well as noise and
envelope effects.
{\bf Graphics:}
It is hard to imagine now, but the Apple II had nice graphics for its time.
Compared to later competitors, however, it had some limitations.
Hardware Sprites & No \\
User-defined charset & No \\
Blanking interrupts & No \\
Palette selection & No \\
Linear framebuffer & No \\
Hardware scrolling & No \\
Hardware page flip & Yes \\
The hi-res graphics mode is a complex mess of NTSC hacks by Woz.
You get approximately 280x192 resolution, with 6 colors available.
The colors are NTSC artifacts with limitations
on which colors can be next to each other (in blocks of 3.5 pixels).
There is plenty of fringing on edges, and colors change depending on
whether they are drawn at odd or even locations.
To add to the madness, the framebuffer is interleaved in a complex way,
and pixels are drawn least-significant-bit first (all of this to make
DRAM refresh better and to shave a few 7400 series logic chips from the design).
You do get two pages of graphics, Page 1 is at
{\tt \$2000}\footnote{On 6502 systems hexadecimal values are
indicated by the dollar sign}
and Page 2 at {\tt \$4000}.
Optionally 4 lines of text can be shown at the bottom of the
screen instead of graphics.
The lo-res mode is a bit easier to use.
It provides 40x48 blocks, reusing the same memory as the 40x24 text mode.
(As with hi-res you can switch to a 40x40 mode with four lines of
text displayed at the bottom).
Fifteen colors are available (there are two greys which are indistinguishable).
Again the addresses are interleaved in a non-linear fashion.
Lo-res Page 1 is at {\tt \$400} and Page 2 is at {\tt \$800}.
Some amazing effects can be achieved by cycle counting, reading
the floating bus, and racing the beam while toggling graphics
modes on the fly.
Unfortunately for you this demo does not do any of those things
so you will not be reading about that today.
\section{Development Setup}
I do all of my coding under Linux, using the nano text editor.
I use the ca65 assembler from the cc65 project, which I find to be a reasonable
tool although many ``real'' Apple II programmers look down on it for some
I cross-compile the code, constructing Apple DOS3.3 disk images using
custom tools I have written.
I test using emulators:
AppleWin (run under the wine emulator) is the easiest to use, but
until recently MESS/MAME had cleaner sound.
Once the code appears to work, I put it on a USB stick and transfer
to actual hardware using a CFFA3000 disk emulator installed in
the actual Apple II (an Apple IIe platinum edition).
\caption{VMW logo hidden in the executable data.\label{fig:vmw}}
\caption{The title screen.\label{fig:title}}
\caption{Bouncing ball on infinite checkerboard.\label{fig:ball}}
\caption{Spaceship flying over an island.\label{fig:tb1}}
\caption{Spaceship with starfield.\label{fig:stars}}
\caption{Rasterbars, stars, and credits. Stealth Susie was a particularly
well-traveled guinea pig.
\section{The Demo}
An Applesoft BASIC ``HELLO'' program loads the binary automatically at bootup.
This does not count towards the executable size, as you could manually BRUN
the 8k machine-language program if you wanted.
To make the loading time slightly more interesting the HELLO program enables
graphics mode and loads the program to address {\tt \$2000} (hi-res page1).
This causes the display to filled with the colorful pattern corresponding
to the compressed image.
This conveniently fills all 8k of the display RAM, or would have
if we had POKEd the right soft-switch to turn off
the bottom 4 lines of text.
Upon loading, execution starts at address {\tt \$2000}.
The binary is encoded with the LZ4 algorithm.
We flip to hi-res Page 2 and decompress to this region so the display
now shows the executable code.
The 6502 size-optimized LZ4 decompression code was written by qkumba
(Peter Ferrie).
The program and data decompress to around 22k starting at {\tt \$4000}.
This over-writes parts of DOS3.3, but since we will not be using the disk
any more this is not an issue.
If you look carefully at the upper left corner of the screen during
decompress you will see my triangular logo, which is supposed to evoke
my VMW initials (see Figure~\ref{fig:vmw}).
To do this I had to put the proper bit pattern inside the code
at the interleaved addresses of {\tt \$4000}, {\tt \$4400}, {\tt \$4800},
and {\tt \$4C00}.
The image data at {\tt \$4000} maps to (mostly)
harmless code so it is left in place and executed.
Making this work turned out to be more trouble than it was worth, especially
as the logo is not visible in the MP4 capture of the demo (the movie
compression does not handle screens full of seemingly random noise well).
The demo was optimized to fit in 8k.
Optimizing code inside of a compressed image is much more complicated than
regular size optimization.
Removing instructions sometimes makes the binary {\em larger} as it no longer
compresses as well.
Long runs of values (such as 0 padding) are essentially free.
This mostly turned into an exercise of guess-and-check until everything fit.
\subsection{TITLE SCREEN}
Once decompression is done, execution continues at address {\tt \$4000}.
We switch to low-res mode for the rest of the demo.
The title screen fades in from black.
This is a software hack as the Apple II does not have palette support.
The image is loaded to an off-screen buffer and a lookup table is used to
copy in the faded versions on the fly.
The title screen is shown in Figure~\ref{fig:title}.
The image is run-length encoded (RLE) which is
probably unnecessary in light of it being further LZ4 encoded.
(The LZ4 compression was a late addition to this endeavor).
Why not save some space and just load our demo at {\tt \$400} and negate
the need
to copy the image in place?
Remember the graphics are 40x48 (shared with the text display region).
It might be easier to think of it as 40x24 characters, with the top / bottom
4-bits of each ASCII character being interpreted as colors for a half-height
If you do the math you will find this takes 960 bytes of space, but the memory
map reserves 1k for this mode.
There are ``holes'' in the address range that are not displayed, and
various pieces of hardware can use these as scratchpad memory.
This means just overwriting the whole 1k with data might not work out well
unless you know what you are doing.
To this end our RLE decompression code skips the holes just to be safe.
The title screen has scrolling text at the bottom.
This is nothing fancy, the text is in a buffer off screen and a 40x4
chunk of RAM is copied in every so many cycles.
You might notice that there is tearing/jitter in the scrolling even
though we are double-buffering the graphics.
Sadly there is not a reliable cross-platform way to get the VBLANK info
on Apple II machines, especially the older models.
This is even more noticeable in the recorded video, as the capture card and
movie encoding conspire to make this look worse than things look in person.
No demo is complete without some exciting background music.
I like chiptune music, especially the kind written
for AY-3-8910 based systems.
During the long time waiting for my Mockingboard hardware to arrive
I designed and built a Raspberry Pi chiptune player that uses
essentially the same hardware.
This allowed me to build up some expertise with the software/hardware
interface in advance.
The song being played is a stripped down and re-arranged version of
``Electric Wave'' from CC'00 by EA (Ilya Abrosimov).
Most of my sound infrastructure involves YM5 files, a format commonly
used by ZX Spectrum and ATARI ST users.
The YM file format is just AY-3-8910 register dumps taken at 50Hz.
To play these back one sets up the sound card to interrupt 50 times a second
and then writes out the 14 register values from each frame in an interrupt
% registers that control the 3 channels. Each AY chip has a dedicated
% VIA 6522 parallel I/O chip that handles the I/O.
Writing out the registers quickly enough is a challenge on the Apple II.
For each register you have to do a handshake then set both the register
number and the value.
It is hard to do this in less than forty 1MHz cycles for each register.
With complex chiptune files (especially those written on an ST with much
faster hardware) it is sometimes not possible to get exact playback
due to the delay.
Further slowdown happens as you want to write both AY chips (the output
is stereo, with one AY on the left and one on the right).
To help with latency on playback we keep track of the last frame written
and only write to the registers that have changed.
Our code detects the Mockingboard at startup; we are lazy and only support
finding the card in Slot 4 (which is a fairly typically location).
The board is initialized, and then one of the 6522 timers is set to
interrupt at 25Hz.
Why 25Hz and not 50Hz? At 50Hz with 14 registers you use 700 bytes/s.
So a 2 minute song would take 84k of RAM, which is much more than is available.
To allow the song to fit in memory (without a fancy circular buffer
decompression routine utilized in my VMW Chiptune music-disk demo) we have
to reduce the size.
First the music is changed so it only needs to be updated at 25Hz.
Then the register data is compressed from 14 bytes to 11 bytes by stripping off
the envelope effects and packing together fields that have unused bits.
In the end the sound quality suffered a bit, but we were able to fit an
acceptably catchy chiptune inside of our 8k payload.
\subsection{MODE7 BACKGROUND}
``Mode7'' is a Super Nintendo (SNES) graphics mode that takes a tiled
background and transforms it by rotating and scaling.
The most common effect squashes the background out to the horizon, giving
a three-dimensional look.
The SNES did these transforms in hardware, but our demo must do
them in software.
% As found on Wikipedia, the transform is of the type
% [x'] = [a b]([x]-[x0])+[x0]
% [y'] [c d]([y] [y0]) [y0]
Our algorithm is based on code by Martijn van Iersel.
It iterates through each y line on the screen and calculates based on
the camera location: height ({\em spacez}), x and y coordinates
({\em cx} and {\em cy}) and the {\em angle}.
First calculate the distance
d = (z*yscale)/(y+horizon)
Then calculate the horizontal scale (distance between points on
this line)
h = d/xscale
Then calculate delta x and delta y values
dx = -sin(angle)*h
dy = cos(angle)*h
It then calculates the starting offset of the left side of the line in
the tile lookup:
tilex = cx + (d*cos(angle) - (width/2) * dx;
tiley = cy + (d*sin(angle) - (width/2) * dy;
Now iterate the inner loop, where we lookup the tile color for each pixel
on the horizontal line.
putpixel (x, y, tilelookup(tilex,tiley)
tilex += dx;
tiley += dy;
{\bf Optimizations}
We managed to take this algorithm and speed it up in the following ways:
\item blah
For our code, we managed to reduce things to a small number of additions
and subtractions for each pixel on the screen. Of course the 6502 can't
do floating point, so we do fixed point math. We convert as much as we
can to table lookups that are pre-calculated. We also make liberal use
of self-modifying code.
{\bf Fast Multiply:}
Despite all of this there are still some cases where we have to do a
16bit x 16bit = 32bit multiply, something that is *really* slow on 6502,
around 700 cycles (for a 8.8 x 8.8 fixed point multiply).
To make this faster we use a method described by Stephen Judd.
The key to note is that $(a+b)^{2} = a^{2}+2ab+b^{2}$
and $(a-b)^{2}=a^{2}-2ab+b^{2}$
and if you add them you can simplify to:
$a\times b =\frac{(a+b)^{2}}{4} - \frac{(a-b)^2}{4}$
This is you have a table of squares from 0..511 (all 8-bit a+b and a-b
will fall in this range) then you can convert a multiply into a table
lookup plus a subtract.
The downsize is you will need 2kB of squares lookup tables (which can
be generated at startup). This reduces the multiply cost to the order
of 200 to 250 cycles.
By using the fast multiply and a lot of careful optimization you can
generate a Mode7 background in 40x40 graphics mode at about 5 frames/second.
The engine can be parameterized with different tilesets to use, which we
do to provide both a black+white checkerboard background, as well as the
island background from the TFV game.
The first Mode7 scene transpires on an infinite checkerboard.
A demo would be incomplete without some sort of bouncing geometric solid,
in this case we have a pink sphere.
The sphere is represented by 16 sprites that were captured from
a 20 year old OpenGL game engine.
Screenshots were taken then reduced to the proper size and color
The shadows are also just sprites.
The clicking noise on bounce is generated by accessing the speaker port
at address {\tt \$C030}.
This gives some sound for those viewing the demo without the benefit
of a Mockingboard.
This next scene has a spaceship flying over an island.
The spaceship, water splash, and shadows are all sprites.
They are all drawn in software as the Apple II has no sprite hardware.
The path the ship takes is pre-recorded; this is adapted from the
Talbot Fantasy~7 game engine with the keyboard code replaced by a hard-coded
script of actions to take.
The spaceship takes to the stars.
This is typical starfield code.
Only 16 stars are modeled, and the movement code re-uses the
same fast-multiply routine described previously.
The star positions require random number generation, but this is not
fast on the 6502.
Originally we had a 256-byte blob of pre-generated ``random'' values
included in the code.
This wasted space, so now instead we just use our code at address
at \$5000 as if it were a block of random numbers.
This was arbitrarily chosen, and it is not as random as it could be
as seen when the ship enters hyperspace the lower right quadrant has fewer
starts than one could desire.
A simple state machine controls star speed, ship movement, hyperspace,
background color (for the blue flash) and the eventual sequence of sprites
as the ship vanishes into the distance.
Once the ship has departed, it is time for the credits as the stars
continue to run.
The text is written to the bottom 4 lines of the screen and appears
to be surrounded by low-res graphics blocks.
Mixed graphics/text would generally not be possible on the Apple II, although
with careful cycle counting and mode switching groups such as FrenchTouch
have achieved this effect.
I was lazy and instead used inverse-mode space characters which appear the same
as white graphics blocks.
The rasterbar effect is not really rasterbars, it's just a colorful assortment
of horizontal lines drawn at a location determined with a sine lookup table.
Horizontal lines can take a surprising amount of time to draw, so this
was optimized using inlining and a few other methods.
The rotating text is done by just rapidly rotating the output string through
the ASCII table, with the clicking effect again by hitting the speaker
at address \$C030.
The list of people to thank ended up being extremely critical to fitting in 8kB,
as unique text strings do not compress well.
I apologize to everyone whose moniker got compressed beyond recognition,
and I am still not totally happy with the centering of the text.
\section{Obtaining the Code}
More details, disk image, and full source can be found at the website:
------------- $ffff
| ROM/IO |
------------- $c000
| |
| Uncompressed|
| Code/Data |
| |
------------- $4000
| Compressed |
| Code |
------------- $2000
| free |
------------- $1c00
| Scroll |
| Data |
------------- $1800
| Multiply |
| Tables |
------------- $1000
| LORES pg 3 |
------------- $0c00
| LORES pg 2 |
------------- $0800
| LORES pg 1 |
------------- $0400
|free/vectors |
------------- $0200
| stack |
------------- $0100
| zero pg |
------------- $0000
\caption{Memory Map (not to scale)\label{fig:map}}