dos33fsprogs/mode7_demo/docs/mode7_demo.tex

\documentclass[twocolumn]{article}
\usepackage{graphicx}
\usepackage{url}
\usepackage{hyperref}
\usepackage{fancyvrb}
\usepackage{fancyhdr}

\usepackage{hyperref}

%\usepackage{graphicx}
\usepackage{colortbl}
\usepackage{multirow}

\pagestyle{fancy}

\fancypagestyle{firststyle}
{
   \fancyhf{}
   \fancyhead[C]{A version of this document appeared in PoC~\textbar\textbar~GTFO 0x18}
   \fancyfoot{}
}


%\fancyhead{}
%\fancyfoot{}
%\fancyhead[CO,CE]{A version of this document appeared in PoC || GTFO 0x18}
%\fancyfoot[C] {\thepage}
%\renewcommand{\headrulewidth}{0pt}
%\renewcommand{\footrulewidth}{0pt}


\begin{document}

\title{Making an 8k Low-resolution Graphics Demo for the Apple II}
\author{by DEATER, AKA Vincent M. Weaver}
\date{}
\maketitle

\thispagestyle{firststyle}

\section{Why would anyone do this?}

While making an inside-joke filled game for my retro system of choice,
the Apple~II, I needed to create a Final-Fantasy-esque
flying-over-the-planet sequence.
I was originally going to fake this, but why fake graphics when you
can laboriously spend weeks implementing the effect for real?
It turns out the Apple~II is just barely capable of generating
the effect in real time.

Once I got the code working I realized it would be great as part of a
graphical demo, so off on that tangent I went.
This turned out well, despite the fact that all I knew about the demoscene I
had learned from a few viewings of the Future Crew {\em Second Reality} demo
combined with dimly remembered Commodore 64 and Amiga usenet flamewars.

% from a few decades ago.
%  This started out as some SNES style mode7 pseudo-3d graphics code
%  I came up with while working on my TF7 game.  The graphics looked
%  pretty cool, so I started developing a demo around it.

%To make thins even better, the code ended up being roughly around 8kB so a
%lot of time was wasted fitting it under that arbitrary size limitation.

While I hope you enjoy the description of the demo and the work that
went into it, I suspect this whole enterprise is primarily of note
due to the dearth of demos for the Apple~II platform.
%So in the end this ends up being impressive mostly because so few people
%have bothered to write demos for this particular platform.
If you are truly interested in seeing impressive Apple~II demos,
I would like to make a shout out to FrenchTouch whose works
put this one to shame.

%  The codesize ended up being roughly around 8kB, so I thought I'd
%  make it into an 8k demo.  There aren't many out there for the Apple II.
%  and a Mockingboard sound card.

%  The demo tries to hit the lowest common denominator for Apple II systems,
%  so in theory you could have run this on an Apple II in 1977 if you
%  were rich enough to afford 48k of RAM.  The Mockingboard sound wasn't
%  available until 1981, but still this all predates the Commodore 64.

%I was writing a game for the Apple II and realized I had come up with
%some clever Super-Nintendo (SNES) style graphics routines that were just
%crying to be turned into a demo-scene style demo.

%The Apple II was the first computer I had access too, and I grew up in an odd
%neighborhood where it was all Apples and not a Commodore to be seen.
%My family long ago got rid of our machine, but I rescued an Apple IIe platinum
%from the dumpster one day and have dragged it from state to state ever since.

%I find 6502 assembly to be oddly therapeutic, and will code in it when other
%projects become too stressful.  Especially when Linux up and hangs on me
%because firefox tried to do something stupid in javascript.  I then pine for
%the days when you could do something useful in 64k of RAM, and not have your
%machine fall over because somehow 4GB is not enough.

%Background:

%The Apple II was the first computer I programmed on, lo many years ago.
%Mostly in Applesoft BASIC (which ended up being the only Microsoft product
%I ever liked) but I was starting to get into assembly language about the
%time my family got a 386 system.

%I've revisited over the years, with some 6502 programming to show I could.
%My skills were not that great, I had one of my size-optimization projects
%crowd re-optimized.  For a while I had a side-gig re-optimizing modern games
%in BASIC, before getting sidetracked into going full in on 6502 assembly
%again.

%Introduced in 1977.
%The Apple II runs at 1.XX check Megahertz.  6502, which can easily
%address 64 kB of RAM (more with bank switching).  Shipped with as little
%as 4kB of RAM.  Three registers, (A,X,Y) but a large ``zero page'' which
%gives you register-like actions on the first 256 bytes of RAM.
%
%DOS3.3 operating system with 140k floppies.  Amazing programming by Wozniak,
%allowing all kinds of floppy protection shenanigans (cite 4am, previous
%article).

\section{The Hardware}

The Apple~II was introduced in 1977.
In theory this demo will run on hardware that old, although I do
not have access to a system of that vintage.
I like to troll Commodore fans by noting this predates the Commodore 64 by
five years.


\vspace{1ex}
\noindent
{\bf CPU, RAM and Storage:}

The Apple II has a 6502 processor running at roughly 1.023MHz.
Early models only shipped with 4k of RAM, but later 48k, 64k, and 128k
systems were common.
While the demo itself fits in 8k, it decompresses to a larger size and uses
a full 48k of RAM;
this would have been very expensive in 1977.
See Figure~\ref{fig:map} for a diagram of the memory map.

Also in 1977 you would probably be loading this from cassette tape.
It would be another year before Woz's single-sided
$5\frac{1}{4}$" Disk II came about (eventually offering 140k of
storage per side with the release of Apple DOS3.3 in 1980).

\vspace{1ex}
\noindent
{\bf Sound:}

The only sound available in a stock Apple II is a bit-banged speaker.
There was no timer interrupt; if you wanted music you had to cycle-count
via the CPU to get the waveforms you needed.

The demo uses a Mockingboard soundcard which was introduced in 1981.
This board contains dual AY-3-8910 sound generation chips connected via
6522 I/O chips.
Each sound chip provides 3 channels of square waves as well as noise and
envelope effects.

\vspace{1ex}
\noindent
{\bf Graphics:}

It is hard to imagine now, but the Apple II had nice graphics for its time.
Compared to later competitors, however, it had some limitations.

\begin{center}
\begin{tabular}{|c|c|}
\hline
Hardware Sprites     &	No \\
User-defined charset &	No \\
Blanking interrupts  &	No \\
Palette selection    &	No \\
Linear framebuffer   &	No \\
Hardware scrolling   &	No \\
Hardware page flip   &	Yes \\
\hline
\end{tabular}
\end{center}

The hi-res graphics mode is a complex mess of NTSC hacks by Woz.
You get approximately 280x192 resolution, with 6 colors available.
The colors are NTSC artifacts with limitations
on which colors can be next to each other (in blocks of 3.5 pixels).
There is plenty of fringing on edges, and colors change depending on
whether they are drawn at odd or even locations.
To add to the madness, the framebuffer is interleaved in a complex way,
and pixels are drawn least-significant-bit first (all of this to get
DRAM refresh for free and to shave a few 7400 series logic chips from
the design).
You do get two pages of graphics, Page 1 is at
{\tt \$2000}\footnote{On 6502 systems hexadecimal values are
traditionally indicated by a dollar sign}
and Page 2 at {\tt \$4000}.
Optionally 4 lines of text can be shown at the bottom of the
screen instead of graphics.

The lo-res mode is a bit easier to use.
It provides 40x48 blocks, reusing the same memory as the 40x24 text mode.
(As with hi-res you can switch to a 40x40 mode with four lines of
text displayed at the bottom).
Fifteen colors are available (there are two greys which are indistinguishable).
Again the addresses are interleaved in a non-linear fashion.
Lo-res Page 1 is at {\tt \$400} and Page 2 is at {\tt \$800}.

Some amazing effects can be achieved by cycle counting, reading
the floating bus, and racing the beam while toggling graphics
modes on the fly.
Unfortunately for you this demo does not do any of those things
so you will not be reading about that today.

%Later models added double low-res (80x48) and double hi-res (x y in
%NTSC 15 color) but didn't appear until 198x, and only on later IIe, IIc
%models.

%Apple also came out with the IIgs which arguably was much more advanced
%and cheaper than the Mac, but Apple cancelled the II line much to the
%sadness of the users (Apple II forever).


\section{Development Setup}

I do all of my coding under Linux, using the nano text editor.
I use the ca65 assembler from the cc65 project, which I find to be a reasonable
tool although many ``real'' Apple II programmers look down on it for some
reason.
I cross-compile the code, constructing Apple DOS3.3 disk images using
custom tools I have written.
I test using emulators:
AppleWin (run under the wine emulator) is the easiest to use, but
until recently MESS/MAME had cleaner sound.

Once the code appears to work, I put it on a USB stick and transfer
to actual hardware using a CFFA3000 disk emulator installed in
the actual Apple II (an Apple IIe platinum edition).

%\section{Related Work}
%
%See anything by the group FrenchTouch, whose Apple II demos outclass
%mine by a lot.


%  http://www.deater.net/weave/vmwprod/mode7_demo/


\begin{figure}[tb]
\begin{center}
\includegraphics[width=2in]{figures/hidden_vmw.png}
\end{center}
\caption{VMW logo hidden in the executable data.\label{fig:vmw}}
\end{figure}

\begin{figure}[tb]
\begin{center}
\includegraphics[width=\columnwidth]{figures/mode7_demo_title.png}
\end{center}
\caption{The title screen.\label{fig:title}}
\end{figure}

\begin{figure}[tb]
\begin{center}
\includegraphics[width=\columnwidth]{figures/m7_screen1.png}
\caption{Bouncing ball on infinite checkerboard.\label{fig:ball}}
\end{center}
\end{figure}

\begin{figure}[tb]
\begin{center}
\includegraphics[width=\columnwidth]{figures/m7_screen4.png}
\caption{Spaceship flying over an island.\label{fig:tb1}}
\end{center}
\end{figure}

\begin{figure}[tb]
\begin{center}
\includegraphics[width=\columnwidth]{figures/m7_screen3.png}
\end{center}
\caption{Spaceship with starfield.\label{fig:stars}}
\end{figure}

\begin{figure}[tb]
\begin{center}
\includegraphics[width=\columnwidth]{figures/m7_screen2.png}
\end{center}
\caption{Rasterbars, stars, and credits.  Stealth Susie was a particularly
well-traveled guinea pig.
\label{fig:credits}}
\end{figure}


\section{The Demo}

\subsection{BOOTLOADER}

An Applesoft BASIC ``HELLO'' program loads the binary automatically at bootup.
This does not count towards the executable size, as you could manually BRUN
the 8k machine-language program if you wanted.

To make the loading time slightly more interesting the HELLO program enables
graphics mode and loads the program to address {\tt \$2000} (hi-res page1).
This causes the display to filled with the colorful pattern corresponding
to the compressed image.
This conveniently fills all 8k of the display RAM, or would have
if we had POKEd the right soft-switch to turn off
the bottom 4 lines of text.

Upon loading, execution starts at address {\tt \$2000}.

\subsection{DECOMPRESSION}

The binary is encoded with the LZ4 algorithm.
We flip to hi-res Page 2 and decompress to this region so the display
now shows the executable code.

The 6502 size-optimized LZ4 decompression code was written by qkumba
(Peter Ferrie).
%	http://pferrie.host22.com/misc/appleii.htm
The program and data decompress to around 22k starting at {\tt \$4000}.
This over-writes parts of DOS3.3, but since we are done with the disk
this is not an issue.

If you look carefully at the upper left corner of the screen during
decompress you will see my triangular logo, which is supposed to evoke
my VMW initials (see Figure~\ref{fig:vmw}).
To do this I had to put the proper bit pattern inside the code
at the interleaved addresses of {\tt \$4000}, {\tt \$4400}, {\tt \$4800},
and {\tt \$4C00}.
The image data at {\tt \$4000} maps to (mostly)
harmless code so it is left in place and executed.
Making this work turned out to be more trouble than it was worth, especially
as the logo is not visible in the youtube capture of the demo (the video
compression does not handle screens full of seemingly random noise well).

The demo was optimized to fit in 8k.
Optimizing code inside of a compressed image is much more complicated than
regular size optimization.
Removing instructions sometimes makes the binary {\em larger} as it no longer
compresses as well.
Long runs of values (such as 0 padding) are essentially free.
This mostly turned into an exercise of guess-and-check until everything fit.


\subsection{TITLE SCREEN}

Once decompression is done, execution continues at address {\tt \$4000}.
We switch to low-res mode for the rest of the demo.

\noindent
{\bf FADE EFFECT}:
The title screen fades in from black.
This is a software hack as the Apple II does not have palette support.
The image is loaded to an off-screen buffer and a lookup table is used to
copy in the faded versions on the fly.

\noindent
{\bf TITLE GRAPHICS}:
The title screen is shown in Figure~\ref{fig:title}.
The image is run-length encoded (RLE) which is
probably unnecessary in light of it being further LZ4 encoded.
(The LZ4 compression was a late addition to this endeavor).

Why not save some space and just load our demo at {\tt \$400} and negate
the need
to copy the image in place?
Remember the graphics are 40x48 (shared with the text display region).
It might be easier to think of it as 40x24 characters, with the top / bottom
4-bits of each ASCII character being interpreted as colors for a half-height
block.
If you do the math you will find this takes 960 bytes of space, but the memory
map reserves 1k for this mode.
There are ``holes'' in the address range that are not displayed, and
various pieces of hardware can use these as scratchpad memory.
This means just overwriting the whole 1k with data might not work out well
unless you know what you are doing.
To this end our RLE decompression code skips the holes just to be safe.

\noindent
{\bf SCROLL TEXT}:
The title screen has scrolling text at the bottom.
This is nothing fancy, the text is in a buffer off screen and a 40x4
chunk of RAM is copied in every so many cycles.
You might notice that there is tearing/jitter in the scrolling even
though we are double-buffering the graphics.
Sadly there is not a reliable cross-platform way to get the VBLANK info
on Apple II machines, especially the older models.
This is even more noticeable in the recorded video, as the capture card and
video encoding conspire to make this look worse than things look in person.

\subsection{MOCKINGBOARD MUSIC}

No demo is complete without some exciting background music.
I like chiptune music, especially the kind written
for AY-3-8910 based systems.
During the long time waiting for my Mockingboard hardware to arrive
I designed and built a Raspberry Pi chiptune player that uses
essentially the same hardware.
This allowed me to build up some expertise with the software/hardware
interface in advance.

The song being played is a stripped down and re-arranged version of
``Electric Wave'' from CC'00 by EA (Ilya Abrosimov).

Most of my sound infrastructure involves YM5 files, a format commonly
used by ZX Spectrum and Atari ST users.
The YM file format is just AY-3-8910 register dumps taken at 50Hz.
To play these back one sets up the sound card to interrupt 50 times a second
and then writes out the 14 register values from each frame in an interrupt
handler.

%   To program the Mockingboard, each AY-3-8910 chip has 14 sound related
%   registers that control the 3 channels.  Each AY chip has a dedicated
%   VIA 6522 parallel I/O chip that handles the I/O.

Writing out the registers quickly enough is a challenge on the Apple II.
For each register you have to do a handshake then set both the register
number and the value.
It is hard to do this in less than forty 1MHz cycles for each register.
With complex chiptune files (especially those written on an ST with much
faster hardware) it is sometimes not possible to get exact playback
due to the delay.
Further slowdown happens as you want to write both AY chips (the output
is stereo, with one AY on the left and one on the right).
To help with latency on playback we keep track of the last frame written
and only write to the registers that have changed.

%   I have a whole suite of code for manipulating YM sound data, in my
%   vmw-meter git repository.

Our code detects the Mockingboard at startup; we are lazy and only support
finding the card in Slot 4 (which is a fairly typically location).
%   The first step for getting this to work is detecting if a Mockingboard is
%%  there.  This can be in any slot 1-7 on the Apple II, though typically
%   Slot 4 is standard (in this demo we only check slot 4).
The board is initialized, and then one of the 6522 timers is set to
interrupt at 25Hz.
% (it has to be an on-board timer as the default
%   Apple II has no timers).
Why 25Hz and not 50Hz?  At 50Hz with 14 registers you use 700 bytes/s.
So a 2 minute song would take 84k of RAM, which is much more than is available.
Also the Disk II requires hard real-time response involving the full
CPU to read from disk, so it is not possible to read more data while
the demo is running.
To allow the song to fit in memory (without the fancy circular buffer
decompression routine utilized in my VMW Chiptune music-disk demo) we have
to reduce the size.
First the music is changed so it only needs to be updated at 25Hz.
Then the register data is compressed from 14 bytes to 11 bytes by stripping off
the envelope effects and packing together fields that have unused bits.
In the end the sound quality suffered a bit, but we were able to fit an
acceptably catchy chiptune inside of our 8k payload.

\subsection{MODE7 BACKGROUND}

``Mode7'' is a Super Nintendo (SNES) graphics mode that takes a tiled
background and transforms it by rotating and scaling.
The most common effect squashes the background out to the horizon, giving
a three-dimensional look.
The SNES did these transforms in hardware, but our demo must do
them in software.

%  As found on Wikipedia, the transform is of the type
%
%  [x'] = [a b]([x]-[x0])+[x0]
%  [y']   [c d]([y] [y0]) [y0]

% http://www.helixsoft.nl/articles/circle/sincos.htm

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our algorithm is based on code by Martijn van Iersel.
It iterates through each horizontal line on the screen and calculates the color
to output based on the camera height ({\em spacez}) and {\em angle} as well
as the current x and y coordinates ({\em cx} and {\em cy}).

First the distance {\em d} is calculated based on fixed scale and
distance-to-horizon factors.
Instead of a costly division we use a pre-generated lookup table for this.
	\[d = \frac{z \times yscale}{y+horizon}\]
Next calculate the horizontal scale (distance between points on
this line):
	\[h = \frac{d}{xscale}\]
Then calculate delta x and delta y values between each block on the line.
We use a pre-computed sine/cosine lookup table.

\pagebreak

	\[dx = -sin(angle) \times h\]
	\[dy = cos(angle) \times h\]
The leftmost position in the tile lookup is calculated:
	\[tilex = cx + (d*cos(angle) - (width/2) * dx\]
	\[tiley = cy + (d*sin(angle) - (width/2) * dy\]
Then an inner loop happens that adds dx and dy as we lookup the color
from the tilemap (just a wrap-around array lookup) for each block
on the line.
	\[color = tilelookup(tilex,tiley)\]
	\[plot (x, y) \]
	\[tilex += dx, tiley+= dy\]

\noindent
{\bf Optimizations:}
The 6502 processor cannot do floating point, so all of our routines use
8.8 fixed point math.
We eliminate all use of division, and convert as much as possible
to table lookups (which involves limiting the heights and angles a bit).
We also save some cycles by using self-modifying code,
most notably hard-coding the height (z) value and modifying the code
whenever this is changed.
The code started out only capable of roughly 4.9fps in 40x20 resolution
and in the end we improved this to 5.7fps in 40x40 resolution.
Care was taken to optimize the innermost loop, as every cycle saved there
results in 1280 cycles saved overall.

\noindent
{\bf Fast Multiply:}
One of the biggest bottlenecks in the mode7 code was the multiply.
Even our optimized algorithm calls for at least seven
16bit x 16bit = 32bit multiplies, something that is {\em really} slow on
the 6502.
A typical implementation takes around 700 cycles
for a 8.8 x 8.8 fixed point multiply.

% Note, this is Quarter-square multiplication, apparently an ancient algorithm
% https://en.wikipedia.org/wiki/Multiplication_algorithm#Quarter_square_multiplication

We improved this by using the ancient quarter-square
multiply algorithm, first described for 6502 use by Stephen Judd.

This works by noting these factorizations:
	\[(a+b)^{2} = a^{2}+2ab+b^{2}\]
	\[(a-b)^{2}=a^{2}-2ab+b^{2}\]
If you subtract these you can simplify to
	\[a\times b =\frac{(a+b)^{2}}{4} - \frac{(a-b)^2}{4}\]

For 8-bit values if you create a table of squares from 0 to 511
(all 8-bit a+b and a-b fall in this range) then you can convert a multiply
into two table lookups and a subtraction.
This does have the downside of requiring 2kB of lookup tables
(which can be generated at startup) but it reduces the multiply
cost to the order of 250 cycles or so.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{BALL ON CHECKERBOARD}

The first Mode7 scene transpires on an infinite checkerboard.
A demo would be incomplete without some sort of bouncing geometric solid,
in this case we have a pink sphere.
The sphere is represented by 16 sprites that were captured from
a 20 year old OpenGL game engine.
Screenshots were taken then reduced to the proper size and color
limitations.
The shadows are also just sprites.
Note that the Apple II has no dedicated sprite hardware, so these
are drawn completely in software.

The clicking noise on bounce is generated by accessing the speaker port
at address {\tt \$C030}.
This gives some sound for those viewing the demo without the benefit
of a Mockingboard.

\subsection{TFV SPACESHIP FLYING}

This next scene has a spaceship flying over an island.
The Mode7 graphics code is generic enough that only one copy of the code
is needed to generate both the checkerboard and island scenes.
The spaceship, water splash, and shadows are all sprites.
The path the ship takes is pre-recorded; this is adapted from the
Talbot Fantasy~7 game engine with the keyboard code replaced by a hard-coded
script of actions to take.

\subsection{STARFIELD}

The spaceship now takes to the stars.
This is typical starfield code, where on each iteration the x and y
values are changed by
	\[dx=\frac{x}{z}, dy=\frac{y}{z}\]
In order to get a good frame rate and not clutter the lo-res screen
only 16 stars are modeled.
To avoid having to divide, the reciprocal of all possible z values
are stored in a table, and the fast-multiply routine described
previously is used.

The star positions require random number generation, but there is no
easy way to quickly get random data on the Apple II.
Originally we had a 256-byte blob of pre-generated ``random'' values
included in the code.
This wasted space, so now instead we just use our code at address
at \$5000 as if it were a block of random numbers.
This was arbitrarily chosen, and it is not as random as it could be
as seen when the ship enters hyperspace and the lower-right quadrant
is distressingly star-free.

A simple state machine controls star speed, ship movement, hyperspace,
background color (for the blue flash) and the eventual sequence of sprites
as the ship vanishes into the distance.

\subsection{RASTERBARS/CREDITS}

Once the ship has departed, it is time to run the credits as the stars
continue to fly by.

The text is written to the bottom four lines of the screen, seemingly
surrounded by graphics blocks.
Mixed graphics/text is generally not be possible on the Apple II, although
with careful cycle counting and mode switching groups such as FrenchTouch
have achieved this effect.
What we see in this demo is the use of inverse-mode (inverted color)
space characters which appear the same as white graphics blocks.

The rasterbar effect is not really rasterbars, just a colorful assortment
of horizontal lines drawn at a location determined with a sine lookup table.
Horizontal lines can take a surprising amount of time to draw, but these
were optimized using inlining and a few other tricks.

The spinning text is done by just rapidly rotating the output string through
the ASCII table, with the clicking effect again generated
by hitting the speaker at address {\tt \$C030}.
The list of people to thank ended up being the primary limitation to
fitting in 8kB, as unique text strings do not compress well.
I apologize to everyone whose moniker got compressed beyond recognition,
and I am still not totally happy with the centering of the text.

\section{Obtaining the Code}

More details, disk image, and full source can be found at the website:
\url{http://www.deater.net/weave/vmwprod/mode7_demo/}

%\section{Appendix: Memory Map}


\begin{figure}
\begin{center}
\begin{scriptsize}
\begin{BVerbatim}
 -------------  $ffff
|    ROM/IO   |
 -------------  $c000
|             |
| Uncompressed|
| Code/Data   |
|             |
 -------------  $4000
| Compressed  |
|   Code      |
 -------------  $2000
|   free      |
 -------------  $1c00
|   Scroll    |
|    Data     |
 -------------  $1800
|  Multiply   |
|   Tables    |
 -------------  $1000
| LORES pg 3  |
 -------------  $0c00
| LORES pg 2  |
 -------------  $0800
| LORES pg 1  |
 -------------  $0400
|free/vectors |
 -------------  $0200
|    stack    |
 -------------  $0100
|   zero pg   |
 -------------  $0000
\end{BVerbatim}
\end{scriptsize}
\end{center}
\caption{Memory Map (not to scale)\label{fig:map}}
\end{figure}

\appendix

\input{dram_notes}


\end{document}