\documentclass[twocolumn]{article} \usepackage{graphicx} \usepackage{url} \usepackage{hyperref} \usepackage{fancyvrb} \begin{document} \title{Making an 8k Low-resolution Graphics Demo for the Apple II} \author{by DEATER, AKA Vincent M. Weaver} \date{} \maketitle \section{Why would anyone do this?} While making an inside-joke filled game for my retro system of choice, the Apple II, I needed to create a Final-Fantasy-esque flying-over-the-planet sequence. I was originally going to fake this, but why fake graphics when you can laboriously spend weeks implementing the effect for real. It turns out the Apple II is just barely capable of generating the effect in real time. Once I got the code working I realized it would be great as part of a graphical demo, so off on that tangent I went. This went well, despite the fact that all I knew about the demoscene I had learned from a few viewings of the Future Crew {\em Second Reality} demo combined with dimly remembered Commodore 64 and Amiga usenet flamewars. % from a few decades ago. % This started out as some SNES style mode7 pseudo-3d graphics code % I came up with while working on my TF7 game. The graphics looked % pretty cool, so I started developing a demo around it. %To make thins even better, the code ended up being roughly around 8kB so a %lot of time was wasted fitting it under that arbitrary size limitation. While I hope you enjoy the description of the demo and the work that went into it, I suspect this whole enterprise is primarily of note due to the dearth of demos for the Apple II platform. %So in the end this ends up being impressive mostly because so few people %have bothered to write demos for this particular platform. If you are truly interested in seeing impressive Apple II demos, I would like to make a shout out to FrenchTouch whose works put this one to shame. % The codesize ended up being roughly around 8kB, so I thought I'd % make it into an 8k demo. There aren't many out there for the Apple II. % and a Mockingboard sound card. % The demo tries to hit the lowest common denominator for Apple II systems, % so in theory you could have run this on an Apple II in 1977 if you % were rich enough to afford 48k of RAM. The Mockingboard sound wasn't % available until 1981, but still this all predates the Commodore 64. %I was writing a game for the Apple II and realized I had come up with %some clever Super-Nintendo (SNES) style graphics routines that were just %crying to be turned into a demo-scene style demo. %The Apple II was the first computer I had access too, and I grew up in an odd %neighborhood where it was all Apples and not a Commodore to be seen. %My family long ago got rid of our machine, but I rescued an Apple IIe platinum %from the dumpster one day and have dragged it from state to state ever since. %I find 6502 assembly to be oddly therapeutic, and will code in it when other %projects become too stressful. Especially when Linux up and hangs on me %because firefox tried to do something stupid in javascript. I then pine for %the days when you could do something useful in 64k of RAM, and not have your %machine fall over because somehow 4GB is not enough. %Background: %The Apple II was the first computer I programmed on, lo many years ago. %Mostly in Applesoft BASIC (which ended up being the only Microsoft product %I ever liked) but I was starting to get into assembly language about the %time my family got a 386 system. %I've revisited over the years, with some 6502 programming to show I could. %My skills were not that great, I had one of my size-optimization projects %crowd re-optimized. For a while I had a side-gig re-optimizing modern games %in BASIC, before getting sidetracked into going full in on 6502 assembly %again. %Introduced in 1977. %The Apple II runs at 1.XX check Megahertz. 6502, which can easily %address 64 kB of RAM (more with bank switching). Shipped with as little %as 4kB of RAM. Three registers, (A,X,Y) but a large ``zero page'' which %gives you register-like actions on the first 256 bytes of RAM. % %DOS3.3 operating system with 140k floppies. Amazing programming by Wozniak, %allowing all kinds of floppy protection shenanigans (cite 4am, previous %article). \section{The Hardware} The Apple II was introduced in 1977. In theory this demo will run on hardware this old, although I do not have access to a system of that vintage. I like to troll Commodore fans by noting this predates the Commodore 64 by five years. \vspace{1ex} \noindent {\bf CPU, RAM and Storage:} The Apple II has a 6502 processor running at roughly 1.023MHz. Early models only shipped with 4k of RAM, but later 48k, 64k, and 128k systems were common. While the demo itself fits in 8k, it decompresses to a larger size and uses a full 48k of RAM; this would have been very expensive in 1977. See Figure~\ref{fig:map} for a diagram of the memory map. Also in 1977 you would probably be loading this from cassette tape. It would be another year before Woz's single-sided $5\frac{1}{4}$" Disk II came about (eventually offering 140k of storage per side with the release of Apple DOS3.3 in 1980). \vspace{1ex} \noindent {\bf Sound:} The only sound available in a stock Apple II is a bit-banged speaker. There was no timer interrupt; if you wanted music you had to cycle-count via the CPU to get the waveforms you needed. The demo uses a Mockingboard soundcard which was introduced in 1981. This board contains dual AY-3-8910 sound generation chips connected via 6522 I/O chips. Each sound chip provides 3 channels of square waves as well as noise and envelope effects. \vspace{1ex} \noindent {\bf Graphics:} It is hard to imagine now, but the Apple II had nice graphics for its time. Compared to later competitors, however, it had some limitations. \begin{center} \begin{tabular}{|c|c|} \hline Hardware Sprites & No \\ User-defined charset & No \\ Blanking interrupts & No \\ Palette selection & No \\ Linear framebuffer & No \\ Hardware scrolling & No \\ Hardware page flip & Yes \\ \hline \end{tabular} \end{center} The hi-res graphics mode is a complex mess of NTSC hacks by Woz. You get approximately 280x192 resolution, with 6 colors available. The colors are NTSC artifacts with limitations on which colors can be next to each other (in blocks of 3.5 pixels). There is plenty of fringing on edges, and colors change depending on whether they are drawn at odd or even locations. To add to the madness, the framebuffer is interleaved in a complex way, and pixels are drawn least-significant-bit first (all of this to make DRAM refresh better and to shave a few 7400 series logic chips from the design). You do get two pages of graphics, Page 1 is at {\tt \$2000}\footnote{On 6502 systems hexadecimal values are indicated by the dollar sign} and Page 2 at {\tt \$4000}. Optionally 4 lines of text can be shown at the bottom of the screen instead of graphics. The lo-res mode is a bit easier to use. It provides 40x48 blocks, reusing the same memory as the 40x24 text mode. (As with hi-res you can switch to a 40x40 mode with four lines of text displayed at the bottom). Fifteen colors are available (there are two greys which are indistinguishable). Again the addresses are interleaved in a non-linear fashion. Lo-res Page 1 is at {\tt \$400} and Page 2 is at {\tt \$800}. Some amazing effects can be achieved by cycle counting, reading the floating bus, and racing the beam while toggling graphics modes on the fly. Unfortunately for you this demo does not do any of those things so you will not be reading about that today. %Later models added double low-res (80x48) and double hi-res (x y in %NTSC 15 color) but didn't appear until 198x, and only on later IIe, IIc %models. %Apple also came out with the IIgs which arguably was much more advanced %and cheaper than the Mac, but Apple cancelled the II line much to the %sadness of the users (Apple II forever). \section{Development Setup} I do all of my coding under Linux, using the nano text editor. I use the ca65 assembler from the cc65 project, which I find to be a reasonable tool although many ``real'' Apple II programmers look down on it for some reason. I cross-compile the code, constructing Apple DOS3.3 disk images using custom tools I have written. I test using emulators: AppleWin (run under the wine emulator) is the easiest to use, but until recently MESS/MAME had cleaner sound. Once the code appears to work, I put it on a USB stick and transfer to actual hardware using a CFFA3000 disk emulator installed in the actual Apple II (an Apple IIe platinum edition). %\section{Related Work} % %See anything by the group FrenchTouch, whose Apple II demos outclass %mine by a lot. % http://www.deater.net/weave/vmwprod/mode7_demo/ \begin{figure}[tb] \begin{center} \includegraphics[width=2in]{figures/hidden_vmw.png} \end{center} \caption{VMW logo hidden in the executable data.\label{fig:vmw}} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics[width=\columnwidth]{figures/mode7_demo_title.png} \end{center} \caption{The title screen.\label{fig:title}} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics[width=\columnwidth]{figures/m7_screen1.jpg} \caption{Bouncing ball on infinite checkerboard.\label{fig:ball}} \end{center} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics[width=\columnwidth]{figures/m7_screen4.jpg} \caption{Spaceship flying over an island.\label{fig:tb1}} \end{center} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics[width=\columnwidth]{figures/m7_screen3.jpg} \end{center} \caption{Spaceship with starfield.\label{fig:stars}} \end{figure} \begin{figure}[tb] \begin{center} \includegraphics[width=\columnwidth]{figures/m7_screen2.jpg} \end{center} \caption{Rasterbars, stars, and credits. Stealth Susie was a particularly well-traveled guinea pig. \label{fig:credits}} \end{figure} \section{The Demo} \subsection{BOOTLOADER} An Applesoft BASIC ``HELLO'' program loads the binary automatically at bootup. This does not count towards the executable size, as you could manually BRUN the 8k machine-language program if you wanted. To make the loading time slightly more interesting the HELLO program enables graphics mode and loads the program to address {\tt \$2000} (hi-res page1). This causes the display to filled with the colorful pattern corresponding to the compressed image. This conveniently fills all 8k of the display RAM, or would have if we had POKEd the right soft-switch to turn off the bottom 4 lines of text. Upon loading, execution starts at address {\tt \$2000}. \subsection{DECOMPRESSION} The binary is encoded with the LZ4 algorithm. We flip to hi-res Page 2 and decompress to this region so the display now shows the executable code. The 6502 size-optimized LZ4 decompression code was written by qkumba (Peter Ferrie). % http://pferrie.host22.com/misc/appleii.htm The program and data decompress to around 22k starting at {\tt \$4000}. This over-writes parts of DOS3.3, but since we will not be using the disk any more this is not an issue. If you look carefully at the upper left corner of the screen during decompress you will see my triangular logo, which is supposed to evoke my VMW initials (see Figure~\ref{fig:vmw}). To do this I had to put the proper bit pattern inside the code at the interleaved addresses of {\tt \$4000}, {\tt \$4400}, {\tt \$4800}, and {\tt \$4C00}. The image data at {\tt \$4000} maps to (mostly) harmless code so it is left in place and executed. Making this work turned out to be more trouble than it was worth, especially as the logo is not visible in the MP4 capture of the demo (the movie compression does not handle screens full of seemingly random noise well). The demo was optimized to fit in 8k. Optimizing code inside of a compressed image is much more complicated than regular size optimization. Removing instructions sometimes makes the binary {\em larger} as it no longer compresses as well. Long runs of values (such as 0 padding) are essentially free. This mostly turned into an exercise of guess-and-check until everything fit. \subsection{TITLE SCREEN} Once decompression is done, execution continues at address {\tt \$4000}. We switch to low-res mode for the rest of the demo. \noindent {\bf FADE EFFECT}: The title screen fades in from black. This is a software hack as the Apple II does not have palette support. The image is loaded to an off-screen buffer and a lookup table is used to copy in the faded versions on the fly. \noindent {\bf TITLE GRAPHICS}: The title screen is shown in Figure~\ref{fig:title}. The image is run-length encoded (RLE) which is probably unnecessary in light of it being further LZ4 encoded. (The LZ4 compression was a late addition to this endeavor). Why not save some space and just load our demo at {\tt \$400} and negate the need to copy the image in place? Remember the graphics are 40x48 (shared with the text display region). It might be easier to think of it as 40x24 characters, with the top / bottom 4-bits of each ASCII character being interpreted as colors for a half-height block. If you do the math you will find this takes 960 bytes of space, but the memory map reserves 1k for this mode. There are ``holes'' in the address range that are not displayed, and various pieces of hardware can use these as scratchpad memory. This means just overwriting the whole 1k with data might not work out well unless you know what you are doing. To this end our RLE decompression code skips the holes just to be safe. \noindent {\bf SCROLL TEXT}: The title screen has scrolling text at the bottom. This is nothing fancy, the text is in a buffer off screen and a 40x4 chunk of RAM is copied in every so many cycles. You might notice that there is tearing/jitter in the scrolling even though we are double-buffering the graphics. Sadly there is not a reliable cross-platform way to get the VBLANK info on Apple II machines, especially the older models. This is even more noticeable in the recorded video, as the capture card and movie encoding conspire to make this look worse than things look in person. \subsection{MOCKINGBOARD MUSIC} No demo is complete without some exciting background music. I like chiptune music, especially the kind written for AY-3-8910 based systems. During the long time waiting for my Mockingboard hardware to arrive I designed and built a Raspberry Pi chiptune player that uses essentially the same hardware. This allowed me to build up some expertise with the software/hardware interface in advance. The song being played is a stripped down and re-arranged version of ``Electric Wave'' from CC'00 by EA (Ilya Abrosimov). Most of my sound infrastructure involves YM5 files, a format commonly used by ZX Spectrum and ATARI ST users. The YM file format is just AY-3-8910 register dumps taken at 50Hz. To play these back one sets up the sound card to interrupt 50 times a second and then writes out the 14 register values from each frame in an interrupt handler. % To program the Mockingboard, each AY-3-8910 chip has 14 sound related % registers that control the 3 channels. Each AY chip has a dedicated % VIA 6522 parallel I/O chip that handles the I/O. Writing out the registers quickly enough is a challenge on the Apple II. For each register you have to do a handshake then set both the register number and the value. It is hard to do this in less than forty 1MHz cycles for each register. With complex chiptune files (especially those written on an ST with much faster hardware) it is sometimes not possible to get exact playback due to the delay. Further slowdown happens as you want to write both AY chips (the output is stereo, with one AY on the left and one on the right). To help with latency on playback we keep track of the last frame written and only write to the registers that have changed. % I have a whole suite of code for manipulating YM sound data, in my % vmw-meter git repository. Our code detects the Mockingboard at startup; we are lazy and only support finding the card in Slot 4 (which is a fairly typically location). % The first step for getting this to work is detecting if a Mockingboard is %% there. This can be in any slot 1-7 on the Apple II, though typically % Slot 4 is standard (in this demo we only check slot 4). The board is initialized, and then one of the 6522 timers is set to interrupt at 25Hz. % (it has to be an on-board timer as the default % Apple II has no timers). Why 25Hz and not 50Hz? At 50Hz with 14 registers you use 700 bytes/s. So a 2 minute song would take 84k of RAM, which is much more than is available. To allow the song to fit in memory (without a fancy circular buffer decompression routine utilized in my VMW Chiptune music-disk demo) we have to reduce the size. First the music is changed so it only needs to be updated at 25Hz. Then the register data is compressed from 14 bytes to 11 bytes by stripping off the envelope effects and packing together fields that have unused bits. In the end the sound quality suffered a bit, but we were able to fit an acceptably catchy chiptune inside of our 8k payload. \subsection{MODE7 BACKGROUND} ``Mode7'' is a Super Nintendo (SNES) graphics mode that takes a tiled background and transforms it by rotating and scaling. The most common effect squashes the background out to the horizon, giving a three-dimensional look. The SNES did these transforms in hardware, but our demo must do them in software. % As found on Wikipedia, the transform is of the type % % [x'] = [a b]([x]-[x0])+[x0] % [y'] [c d]([y] [y0]) [y0] % http://www.helixsoft.nl/articles/circle/sincos.htm %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Our algorithm is based on code by Martijn van Iersel. It iterates through each horizontal line on the screen and calculates the color to output based on the camera height ({\em spacez}) and {\em angle} as well as the current x and y coordinates ({\em cx} and {\em cy}). First the distance {\em d} is calculated based on fixed scale and distance-to-horizon factors. Instead of a costly division we use a pre-generated lookup table for this. \[d = \frac{z \times yscale}{y+horizon}\] Then calculate the horizontal scale (distance between points on this line): \[h = \frac{d}{xscale}\] Then calculate delta x and delta y values between each block on the line. We use a pre-computed sine/cosine lookup table. \[dx = -sin(angle) \times h\] \[dy = cos(angle) \times h\] The leftmost position in the tile lookup is calculated: \[tilex = cx + (d*cos(angle) - (width/2) * dx\] \[tiley = cy + (d*sin(angle) - (width/2) * dy\] Then an inner loop happens that adds dx and dy as we lookup the color from the tilemap (just a wrap-around array lookup) for each block on the line. \[color = tilelookup(tilex,tiley)\] \[plot (x, y) \] \[tilex += dx, tiley+= dy\] \noindent {\bf Optimizations:} The 6502 processor cannot do floating point, so all of our routines used 8.8 fixed point math. We eliminated all of the division, and converted as much as possible to use lookup tables (which involved limiting the heights and angles a bit). We also saved some cycles here and there by using self-modifying code, most notably hard-coding the height (z) value and modifying the code if this is changed. The code started out only capable of roughly 4.9fps in 40x20 resolution and in the end we improved this to 5.7fps in 40x40 resolution. Care was taken to optimize the innermost loop, as every cycle saved there results in 1280 cycles saved overall. \noindent {\bf Fast Multiply:} One of the biggest bottlenecks in the mode7 code was the multiply. Even our optimized algorithm calls for at least seven 16bit x 16bit = 32bit multiplies, something that is {\em really} slow on the 6502. A typical implementation takes around 700 cycles for a 8.8 x 8.8 fixed point multiply. We improved this by using the fast multiply algorithm described by Stephen Judd. This works by noting that \[(a+b)^{2} = a^{2}+2ab+b^{2}\] and \[(a-b)^{2}=a^{2}-2ab+b^{2}\] If you subtract these you can simplify to \[a\times b =\frac{(a+b)^{2}}{4} - \frac{(a-b)^2}{4}\] For 8-bit values if you create a table of squares from 0 to 511 (all 8-bit a+b and a-b fall in this range) then you can convert a multiply into two table lookups plus a subtract. This does have the downside of requiring 2kB of square lookup tables (which can be generated at startup) but it reduces the multiply cost to the order of 250 cycles or so. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{BALL ON CHECKERBOARD} The first Mode7 scene transpires on an infinite checkerboard. A demo would be incomplete without some sort of bouncing geometric solid, in this case we have a pink sphere. The sphere is represented by 16 sprites that were captured from a 20 year old OpenGL game engine. Screenshots were taken then reduced to the proper size and color limitations. The shadows are also just sprites. Note that the Apple II has no dedicated sprite hardware, so these are drawn completely in software. The clicking noise on bounce is generated by accessing the speaker port at address {\tt \$C030}. This gives some sound for those viewing the demo without the benefit of a Mockingboard. \subsection{TFV SPACESHIP FLYING} This next scene has a spaceship flying over an island. The Mode7 graphics code is generic enough that only one copy of the code is needed to generate both the checkerboard and island scenes. The spaceship, water splash, and shadows are all sprites. The path the ship takes is pre-recorded; this is adapted from the Talbot Fantasy~7 game engine with the keyboard code replaced by a hard-coded script of actions to take. \subsection{STARFIELD} The spaceship now takes to the stars. This is typical starfield code, where on each iteration the x and y values are changed by \[dx=\frac{x}{z}, dy=\frac{y}{z}\] In order to get a good frame rate and not clutter the lo-res screen only 16 stars are modeled. To avoid having to divide, the reciprocal of all possible z values are stored in a table, and the fast-multiply routine described previously is used. The star positions require random number generation, but there is no easy way to quickly get random data on the Apple II. Originally we had a 256-byte blob of pre-generated ``random'' values included in the code. This wasted space, so now instead we just use our code at address at \$5000 as if it were a block of random numbers. This was arbitrarily chosen, and it is not as random as it could be as seen when the ship enters hyperspace and the lower-right quadrant is distressingly star-free. A simple state machine controls star speed, ship movement, hyperspace, background color (for the blue flash) and the eventual sequence of sprites as the ship vanishes into the distance. \subsection{RASTERBARS/CREDITS} Once the ship has departed, it is time to run the credits as the stars continue to fly by. The text is written to the bottom four lines of the screen, seemingly surrounded by graphics blocks. Mixed graphics/text is generally not be possible on the Apple II, although with careful cycle counting and mode switching groups such as FrenchTouch have achieved this effect. What we see in this demo is the use of inverse-mode (inverted color) space characters which appear the same as white graphics blocks. The rasterbar effect is not really rasterbars, just a colorful assortment of horizontal lines drawn at a location determined with a sine lookup table. Horizontal lines can take a surprising amount of time to draw, but these were optimized using inlining and a few other tricks. The spinning text is done by just rapidly rotating the output string through the ASCII table, with the clicking effect again generated by hitting the speaker at address {\tt \$C030}. The list of people to thank ended up being the primary limitation to fitting in 8kB, as unique text strings do not compress well. I apologize to everyone whose moniker got compressed beyond recognition, and I am still not totally happy with the centering of the text. \section{Obtaining the Code} More details, disk image, and full source can be found at the website: \url{http://www.deater.net/weave/vmwprod/mode7_demo/} %\section{Appendix: Memory Map} \begin{figure} \begin{center} \begin{scriptsize} \begin{BVerbatim} ------------- $ffff | ROM/IO | ------------- $c000 | | | Uncompressed| | Code/Data | | | ------------- $4000 | Compressed | | Code | ------------- $2000 | free | ------------- $1c00 | Scroll | | Data | ------------- $1800 | Multiply | | Tables | ------------- $1000 | LORES pg 3 | ------------- $0c00 | LORES pg 2 | ------------- $0800 | LORES pg 1 | ------------- $0400 |free/vectors | ------------- $0200 | stack | ------------- $0100 | zero pg | ------------- $0000 \end{BVerbatim} \end{scriptsize} \end{center} \caption{Memory Map (not to scale)\label{fig:map}} \end{figure} \end{document}