Challenges found writing an Apple II chiptune player ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ by DEATER (Vince Weaver, vince@deater.net) http://www.deater.net/weave/vmwprod/chiptune/ ==================================================== GOAL: ~~~~~ The goal is to design a chiptune player that can play large (150k+ uncompressed) chiptune files on an Apple II with 48k of RAM and a Mockingboard sound card. You in theory could have had an Apple II with 48k in 1977 (if you were rich) and Mockingboards came around 1981, so this all predates the Commodore 64. USING: ~~~~~~ Boot disk on a real system, or emulator with Mockingboard support. Applewin works fine (even under Wine on Linux). MESS does too, it's harder to setup (ROMs) but the audio sounds clearer. Space pauses, Left/Right arrow switches songs. You can load up your own YM5 files. Get the "ym5_to_krw" utility found in the repository https://github.com/deater/vmw-meter/ Copy the files to the disk image, and edit the filenames in chiptune.s (sorry, don't have code that CATALOGs automatically. TODO?) HARDWARE: ~~~~~~~~~ Sound ===== The Mockingboard card has two AY-3-8910 chips, each interfaced with a VIA 6522 I/O chip. The 6522 more or less acts as a GPIO expander, plus provides programmable timer interrupts (something the Apple II lacks). The AY-3-8910 chip provides three channels of square waves, plus noise. There is also a (global) envelope generator (though it's typically not used that much). The Mockingboard has two AY-3-8910s, so you can have up to six channels of sound (3 on right, 3 on left). Processor ========= The Apple II has a 6502 processor running at 1.023 MHz. RAM === You could get Apple IIs with as little as 4k of RAM. Eventually models with 48k, 64k and 128k were popular, but due to I/O and ROM constraints to access more than 48k you had to do bank switching. DISK ==== The typical 5 1/4" floppy was single sided and by the time of DOS3.3 held 140k of data. Roughly 16k was used by DOS though if you wanted a bootable disk. There are all kinds of ways you can cheat and extend this, as well as using a "real" O/S like ProDOS. However growing up all I ever really used was DOS3.3 so I'm using it for the sake of tradition. Also if you want to run DOS3.3 then RAM from $9600 up through $C000 is used by the O/S. For this project I use stock DOS3.3 so we lose that amount of RAM (almost 11k). SOUND DATA: ~~~~~~~~~~~ The AY-3-8910 chips are very flexible and can be programmed in a wide variety of ways. I'm attempting to play YM files, which are chiptune files popular in the Atari and Spectrum communities. These are RAW register dumps; every 50Hz (they tend to be European) the contents of the 14 AY-3-8910 registers are written. A raw data stream is 700 bytes (50*14) a second, so 42k per minute. This means holding a raw, uncompressed, data stream in RAM becomes a challenge. COMPRESSION: ~~~~~~~~~~~~ The register values tend to be repetitive so they compress well. Especially if you interleave the files (have all of the register 0 data in a row, followed by all the data for register 1, etc. This is a lot harder to play but you can get compression ratios of over 10 times, see the chart at the end of this document). In addition, the file data can be compressed even more if you notice unused bits in the data. For example, the register data has many unused bits (the period data is only 12 bits for each channel). Also many songs do not use the envelope feature at all, freeing up 3 bytes. So custom compression that can make assumptions about the sound format can free up many bytes even in a raw register dump format. A typical ym5 file is compressed with LHA compression which isn't practical for compression. The LZ4 algorithm is nearly as good and has existing 6502 implementations which can be adapted. It isn't really a streaming algorithm though, so it is hard to decompress only a chunk of the file at a time, usually you need to decompress the whole file at once (the format works by referencing bit sequences from earlier decompressed data). This is especially troublesome with interleaved files, as although they compress really well, you end up decompressing all of the register-0 data before you get to register-1 so with limited RAM you have to change how you deal with things. KRW File Format ~~~~~~~~~~~~~~~ I ended up creating yet another sound file format, and wrote a converter that can convert YM5 files to this KRW format. The format assumes you take the raw interleaved data, and then break it up into 768 byte * 14 register (10.5k) chunks. These chunks are compressed independently and concatenated together. The player then decompresses these chunks one by one as it pays through the song. The compression ratio is not as good as compressing the entire file, but it allows most reasonable-length ym5 files to be played. The format is as follows: 3 bytes Magic Number KRW 1 byte Skip Value Bytes to skip to get to first LZ4 data 1 byte Title Center Spaces to print to center on 40col X bytes Title String 0-terminated ASCII Title of song 1 byte Author Center X bytes Author String 1 byte Time Center 14 bytes Time String " 0:00 / M:SS\0" with length filled in Repeated block data 2 bytes Chunk Length Little Endian size of LZ4 block X bytes LZ4 data After last block, a value of 0/0 indicates end For proper end-of-song detection, the file data should be interleaved and the data at the end should be padded with all $FF characters. End of song is detected by an FF in register[1] which in theory is not possible in a valid register dump. PLAYING THE SONG ~~~~~~~~~~~~~~~~~ An interrupt routine wakes at 50Hz to write the registers and a few other housekeeping things. We load the KRW file totally into RAM before playing. The Disk II controller designed by Woz is amazing, but it is timing sensitive so interrupts are disabled when loading from disk. We have to have room in RAM for the player (4k) the KRW file (16k) and the current uncompressed data (14k). See the memory map diagram at the end. We also have some visualization going on that plots the amplitude of the three channels, plus has a rasterbar type thing going on in the background. Originally the graphics was done full speed in a loop outside the interrupt handler, but as we'll see due to glitchy audio we had to do some hackish things. The actual player is fairly simple, just reads the interleaved data by striding through memory and writing out to the registers. A frame only takes maybe 2400 or so cycles. I ended up creating a 3-phase state machine to handle co-ordinating the three modes A: playing chunk 1 while copying chunk 3 data to extra buffer B: just playing chunk 2 C: playing from extra buffer while decoding next LZ4 block to 1-2-3 I track these in one variable, with the states in the high bits, $80, $40, $20. The BIT instruction lets us easily check for these and a ROL instruction easily switches between the states. CHALLENGES: ~~~~~~~~~~~ The primary problem is decompression also takes a while, longer than the 50Hz available (20ms). It turns out the default LZ4 algorithm from qkumba can often take upwards of 700ms, leading to a long pause in the playback. First Attempt ============= My first attempt to work around this was to load the 3 chunks of data as in the naive approach, but in the background copy chunk 3 in RAM, and then play from the copied RAM while decompressing the next LZ4 in the background. This first attempt almost worked, but it tried to split up the LZ4 decompression into 1/256th chunks to spread across the last chunk being played but the LZ4 is too irregular for that. Some file-chunks decompress in irregular ways that don't split up well. Second Attempt ============== One 256-interrupt chunk of data being played takes about 5s and no data chunk seems to take more than 1s to decode. So we can just cheat and move the graphics code into the interrupt, and have the decoding happen in non-interrupt space. This will work for the chiptune player, but it's not going to work well for something like a video game where you are truly trying to have the music playing unattended in the background (unless your music consists only of 15s loops). FITTING ONTO DISK ~~~~~~~~~~~~~~~~~ Apple II DOS33 filesystem uses 256 byte blocks. Each file has at least one 256 byte Track/Sector list file (and takes an additional one for each 28k or so of filesize). DOS itself reserves the first 3 tracks (12k) and in theory the catalog reserves an entire track (4k) to hold file info (although you only need on 256 byte sector per 7 files). In addition usually you have a "HELLO" BASIC file that runs at boot which is going to take at least 512 bytes. So even though the Disk II / DOS3.3 can in theory hold 140k, after DOS (12k), the Catalog track (4k), HELLO(512 bytes), and our chiptune player (4k) we have 24.5k of overhead, with 115.5k free (462 blocks). The layout of our disk packed to the max with KRW files can be seen in the Figure at the end. We do manage to fit over 30 minutes of music on one disk. It would fit a lot more if we had simple songs that compressed better rather than the complex chiptune examples I picked. MEMORY LAYOUT ~~~~~~~~~~~~~ As can be seen from the memory map below, if we assume our player can fit in 4k we have roughly from $2000 to $9600 for memory. That's $7600 (29.5k). If we could have single buffered, we could have had 256*3*14 (10.5k) for decompress and 19k for file size which would let us play most of the reasonable sized songs on our play list (KRW(3) in table at end). For double buffer, then we need 256*2*14*2 (14k) for decompress and 16k for file size which still works. Memory Map (not to scale) ------- $ffff | ROM/IO| ------- $c000 |DOS3.3 | -------| $9600 | | | | | FREE | | | | | |------- $0c00 |GR pg 1| |------- $0800 |GR pg 0| ------- $0400 | | ------- $0200 |stack | ------- $0100 |zero pg| ------- $0000 Sizes Disk time ym5 KRW(3) KRW(2) Blocks(3) ~~~~ ~~~ ~~~~~~ ~~~~~~ ~~~~~~ KORO.KRW 0:54 ? 2740 3039 12 12 FIGHTING.KRW 1:40 ? 3086 3316 14 14 CAMOUFLAGE.KRW 1:32 1162 4054 4972 17 17 DEMO4.KRW 2:05 1393 4061 6336 17 17 SDEMO.KRW 2:12 1635 5266 7598 22 22 CHRISTMAS.KRW 1:32 1751 4975 5811 21 21 SPUTNIK.KRW 2:05 2164 8422 10779 34 34 DEATH2.KRW 2:27 2560 8064 10295 33 33 CRMOROS.KRW 1:29 2566 8045 9565 33 33 TECHNO.KRW 2:23 2630 8934 11126 36 36 WAVE.KRW 2:52 2655 8368 11318 34 34 LYRA2.KRW 3:04 2870 9826 14418 40 40 INTRO2.KRW 2:59 3217 9214 9294 37 37 ROBOT.KRW 1:26 3448 7724 8337 32 32 UNIVERSE.KRW 1:49 4320 9990 11225 41 41 NEURO.KRW 3:47 8681 22376 25168 89 AXELF.KRW 10:55 9692 47989 54420 189 ----- ----- 423 30:29 Notes: my home-made songs don't have ym5 sizes as I don't have a working LHA encoder to make a real size. Interesting bugs that were hard to debug: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Bug in qkumba's LZ4 decoder, only happened when a copy-block size was exactly a multiple of 256, in which case it would copy an extra time. + Bug where the box-drawing was starting at 0 rather than at Y. Turns out I was padding the filename buffer with A0 but going one too far and it was writing A0 to the first byte of the hlin routine, and A0 is a LDY # instruction. Know the current problem, taking longer than 5s to decode file. Thought it only took 1s max? Not in face of interrupts. Every 20,000 an interrupt 1,500 for music 7,500 for volume bars 16,200 (!)for raster bars 2,000 for misc rest Roughly 13,000 cycles, leaving only 7000 to userspace If takes 700,000 cycles to decode a block, will take 100 Hz cycles, or 2s to finish? that should be doable. why does it instead take 15? Can play fine if I turn the raster bars off. TIME_TAKEN ($88) stores how long took to decode INTRO2: 60@19, 60@36, 62@50 61@1:03 61@1:32 60@2:05 61@2:32