Fixup player timings and opcode variants for 65c02 timings since JMP

(indirect) takes 5 cycles not 6! It should be possible to also accommodate 6502 timings in a followup. h/t to Scott Duensing who noticed that my sample audio sounded "a tad slow", which turned out to be due to this 1-cycle difference (which added up to almost an extra minute playback to an 8-minute song). Add comments and tidy up the code a bit. Flesh out README some more.
2024-06-25 22:29:29 +00:00 · 2020-08-16 23:15:30 +01:00 · 2020-08-16 23:15:30 +01:00 · 4767ee51fd
commit 4767ee51fd
parent 9e0e1fcbcb
6 changed files with 284 additions and 334 deletions
--- a/README.md
+++ b/README.md
@ -26,43 +26,46 @@ possible.  This includes looking some number of cycles into the future to antici

 The resulting bytestream directs the Apple II to follow this speaker trajectory with cycle-level precision.

-The actual audio playback code is small enough to fit in page 3.  i.e. would have been small enough to type in from a
-magazine back in the day (the megabytes of audio data would have been hard to type in though).  Plus, Uthernets didn't
-exist back then (although a Slinky RAM card would let you do something similar, see Future Work below).
+The actual audio playback code is small enough (~150 bytes) to fit in page 3.  i.e. would have been small enough to type
+in from a magazine back in the day.  The megabytes of audio data would have been hard to type in though ;)  Plus,
+Uthernets didn't exist back then (although a Slinky RAM card would let you do something similar, see Future Work below).

 # Implementation

-## Player
-
 The audio player uses [delta modulation](https://en.wikipedia.org/wiki/Delta_modulation) to produce the audio signal.

 How this works is by modeling the Apple II speaker as an [RC circuit](https://en.wikipedia.org/wiki/RC_circuit).  When
-we tick the speaker (access $C030) it inverts the applied voltage across it, and the speaker responds by moving
-asymptotically towards the new applied voltage level.  With some empirical tuning of the time constant of this RC
-circuit, we can precisely model how the Apple II speaker will respond to voltage changes, and use this to make the
-speaker "trace out" our desired waveform.  We can't do this exactly so there is some left-over quantization noise that
-manifests as background static.
+we access $C030 it inverts the applied voltage across the speaker, and the speaker responds by moving
+asymptotically towards the new applied voltage level.  Left to itself this results in an audio "tick".  With some
+empirical tuning of the time constant of this RC circuit, we can precisely model how the Apple II speaker will respond
+to voltage changes, and use this to make the speaker "trace out" our desired waveform.  We can't do this exactly --
+the speaker will zig-zag around the target waveform because we can only move it in finite steps -- so there is some
+left-over quantization noise that manifests as background static, though in our case this is barely noticeable.

 Delta modulation with an RC circuit is also called "BTC", after https://www.romanblack.com/picsound.htm who described
 a number of variations on these (Apple II-like) audio circuits and Delta modulation audio encoding algorithms.  See e.g.
 Oliver Schmidt's [PLAY.BTC](https://github.com/oliverschmidt/Play-BTc) for an Apple II implementation that plays from
-memory.
+memory at 33KHz

-The big difference with our approach is that we are able to target a 1-cycle resolution, i.e. modulate the audio at
-1MHz.  The caveat is that we once we toggle the speaker there is a "cooldown period" of 10 cycles (9 cycles on 6502)
-until we can toggle it again, though we can target any period larger than 11 (i.e. possible values are every 10, 12, 13,
-14, ... cycles).  Successive choices are independent.
+The big difference with our approach is that we are able to target a 1MHz sampling rate, i.e. manipulate the speaker
+with 1-cycle precision, by choosing how the "player opcodes" are chained together by the ethernet bytestream.
+The catch is that once we have toggled the speaker we can't toggle it again until at least 10 cycles have passed (9
+cycles on 6502), but we can pick any such interval >= 10 cycles (except for 11 cycles because of 65x02 opcode timing
+limitations).  Successive choices are independent.

-In other words, we are able to choose a precise sequence of clock cycles in which to toggle the speaker, but these
-cannot be spaced too close together.
+In other words, we are able to choose a precise sequence of clock cycles in which to toggle the speaker, but there is a
+"cooldown" period and these cannot be spaced too close together.

-This minimum period of 10 cycles is already short enough that it produces high-quality audio even if we only modulate
-the speaker at a fixed cadence of 10 cycles (i.e. at 102.4KHz), although in practice a fixed 14-cycle period gave better
-audio (10 cycles produces a quiet but audible background tone coming from some kind of harmonic).  The initial version
-of ][-Sound used this approach (and used the "spare" 4 cycles for a page-flipping trick to visualize the audio bitstream
-while playing).
+The minimum period of 10 cycles is already short enough that it produces high-quality audio even if we only modulate
+the speaker at a fixed cadence of 10 cycles (i.e. at 102.4KHz instead of 1MHz), although in practice a fixed 14-cycle
+period gave better quality (10 cycles produces a quiet but audible background tone coming from some kind of harmonic --
+perhaps an interaction with the every-64-cycle "long cycle" of the Apple II).  The initial version of ][-Sound used this
+approach (and also used the "spare" 4 cycles for a page-flipping trick to visualize the audio bitstream while playing).

-The player consists of some ethernet setup code and a core playback loop of "player opcodes", which are the 
+## Player
+
+The player consists of some ethernet setup code and a core playback loop of "player opcodes", which are the basic
+operations that are dispatched to by the bytestream.

 Some other tricks used here:

@ -80,8 +83,8 @@ Some other tricks used here:

 - As with my [\]\[-Vision](https://github.com/KrisKennaway/ii-vision) streaming video+audio player, we schedule a "slow
  path" dispatch to occur every 2KB in the byte stream, and use this to manage the socket buffers (ACK the read 2KB and
-  wait until at least 2KB more is available, which is usually non-blocking).  While doing this we need to maintain the
-  13 cycle cadence so the speaker is in a known trajectory.  We can compensate for this in the audio encoder.
+  wait until at least 2KB more is available, which is usually non-blocking).  While doing this we need to maintain a
+  regular tick cadence so the speaker is in a known trajectory.  We can compensate for this in the audio encoder.

 ## Encoding

@ -98,7 +101,7 @@ choose to schedule during this cycle window.  This makes the encoding exponentia
 it allows us to e.g. anticipate large amplitude changes by pre-moving the speaker to better approximate them.

 This also needs to take into account scheduling the "slow path" every 2048 output bytes, where the Apple II will manage
-the TCP socket buffer while ticking the speaker at a constant cadence (currently chosen to be every 13 cycles).  Since
+the TCP socket buffer while ticking the speaker at a constant cadence (currently chosen to be every 14 cycles).  Since
 we know this is happening we can compensate for it, i.e. look ahead to this upcoming slow path and pre-position the
 speaker so that it introduces the least error during this "dead" period when we're keeping the speaker in a net-neutral
 position.
@ -115,8 +118,9 @@ where:
   making during each clock cycle.  A value of 500 (i.e. moving 1/500 of the distance) seems to be about right for my
   Apple //e.  This corresponds to a time constant of about 500us for the speaker RC circuit.

-*  `lookahead steps` defines how far into the future we want to look when optimizing.  This is exponentially slower
-   since we have to evaluate all 2^N possible combinations of tick/no-tick.  A value of 15-20 gives good quality.
+*  `lookahead steps` defines how many cycles into the future we want to look when optimizing.  This is exponentially
+   slower since we have to evaluate all possible sequences of player opcodes that could be chosen within the lookahead
+   horizon.  A value of 20 gives good quality.

 *  `output.a2s` is the output file to write to.

@ -137,7 +141,7 @@ Hard-coding the ethernet config is not especially user friendly.  This should be
 ### 6502 support

 The player relies heavily on the JMP (indirect) 6502 opcode, which has a different cycle count on the 6502 (5 cycles)
-and 65c02 (6 cycles).  This means the player will be about 10% faster on a 6502 (e.g. II+, Unenhanced //e), but audio
+and 65c02 (6 cycles).  This means the player will be about 10% **faster** on a 6502 (e.g. II+, Unenhanced //e), but audio
 quality will be off until the encoder is made aware of this and able to compensate.

 This might be one of the few pieces of software for which a 65c02 at the same clock speed causes a measurable
@ -152,13 +156,13 @@ optimizations are possible but rewriting in e.g. C++ should give a large perform

 We can tick the speaker more frequently than 10 cycles using a couple of methods:

- chaining multiple STA $C030 together, e.g. to give a 4/.../4/4/9 cadence.
+- chaining multiple STA $C030 together, e.g. to give a 4/.../4/4/10 cadence.

- by exploiting 6502 "false reads".  During the course of executing a 6502 opcode, the CPU may access memory locations
-  multiple times (up to 4 times, during successive clock cycles).  This would give additional options for (partial)
-  control of the speaker in the <10-cycle period regime.
-  
-It remains to be seen to what extent these approaches may effect audio quality.
+- by exploiting 6502 opcodes that repeatedly access memory during execution, including "false reads".  During the course
+  of executing a 6502 opcode, the CPU may access memory locations multiple times (up to 4 times, during successive clock
+  cycles).  This would give additional options for (partial) control of the speaker in the <10-cycle period regime.
+
+Early results suggest that using these exotic opcode variants (e.g. INC $C030) may give a quality boost.

 ### Measure speaker time constants

@ -183,5 +187,5 @@ e.g. Oliver Schmidt's [PLAY.BTC](https://github.com/oliverschmidt/Play-BTc), tho
 audio data for them did not exist.  It should be possible to adapt the ][-sound encoder to produce better-quality audio
 for these existing players.

-I think it should also be possible to improve quality at similar bitrate, through using some of the cycle-level targeting
-techniques (though perhaps not at full 1-cycle resolution).
+I think it should also be possible to improve in-memory playback quality at similar bitrate, through using some of the
+cycle-level targeting techniques (though perhaps not at full 1-cycle resolution).
--- a/encode_audio.py
+++ b/encode_audio.py
@ -1,101 +1,78 @@
 #!/usr/bin/env python3
 # Delta modulation audio encoder.
 #
-# Models the Apple II speaker as an RC circuit with given time constant
-# and computes a sequence of speaker ticks at multiples of 13-cycle intervals
-# to approximate the target audio waveform.
+# Simulates the Apple II speaker at 1MHz (i.e. cycle-level) resolution,
+# by modeling it as an RC circuit with given time constant.  In order to
+# reproduce a target audio waveform, we upscale it to 1MHz sample rate,
+# and compute the sequence of player opcodes to best reproduce this waveform.
 #
-# To optimize the audio quality we look ahead some defined number of steps and
-# choose a speaker trajectory that minimizes errors over this range.  e.g.
-# this allows us to anticipate large amplitude changes by pre-moving
+# Since the player opcodes are chosen to allow ticking the speaker during any
+# given clock cycle (though with some limits on the minimum time
+# between ticks), this means that we are able to control the Apple II speaker
+# with cycle-level precision, which results in high audio fidelity with low
+# noise.
+#
+# To further optimize the audio quality we look ahead some defined number of
+# cycles and choose a speaker trajectory that minimizes errors over this range.
+# e.g. this allows us to anticipate large amplitude changes by pre-moving
 # the speaker to better approximate them.
 #
-# This also needs to take into account scheduling the "slow path" every 2048
-# output bytes, where the Apple II will manage the TCP socket buffer while
-# ticking the speaker every 13 cycles.  Since we know this is happening
-# we can compensate for it, i.e. look ahead to this upcoming slow path and
-# pre-position the speaker so that it introduces the least error during
-# this "dead" period when we're keeping the speaker in a net-neutral position.
+# This also needs to take into account scheduling the "slow path" opcode every
+# 2048 output bytes, where the Apple II will manage the TCP socket buffer while
+# ticking the speaker at a regular cadence of 13 cycles to keep it in a
+# net-neutral position.  When looking ahead we can also (partially)
+# compensate for this "dead" period by pre-positioning.

+import collections
 import sys
 import librosa
 import numpy
-from typing import List, Tuple
 from eta import ETA

 import opcodes

-
-#
-# # TODO: test
-# @functools.lru_cache(None)
-# def lookahead_patterns(
-#         lookahead: int, slowpath_distance: int,
-#         voltage: float) -> numpy.ndarray:
-#     initial_voltage = voltage
-#     patterns = set()
-#
-#     slowpath_pre_bits = 0
-#     slowpath_post_bits = 0
-#     if slowpath_distance <= 0:
-#         slowpath_pre_bits = min(12 + slowpath_distance, lookahead)
-#     elif slowpath_distance <= lookahead:
-#         slowpath_post_bits = lookahead - slowpath_distance
-#
-#     enumerate_bits = lookahead - slowpath_pre_bits - slowpath_post_bits
-#     assert slowpath_pre_bits + enumerate_bits + slowpath_post_bits == lookahead
-#
-#     for i in range(2 ** enumerate_bits):
-#         voltage = initial_voltage
-#         pattern = []
-#         for j in range(slowpath_pre_bits):
-#             voltage = -voltage
-#             pattern.append(voltage)
-#
-#         for j in range(enumerate_bits):
-#             voltage = 1.0 if ((i >> j) & 1) else -1.0
-#             pattern.append(voltage)
-#
-#         for j in range(slowpath_post_bits):
-#             voltage = -voltage
-#             pattern.append(voltage)
-#
-#         patterns.add(tuple(pattern))
-#
-#     res = numpy.array(list(patterns), dtype=numpy.float32)
-#     return res
+# TODO: add flags to parametrize options


 def lookahead(step_size: int, initial_position: float, data: numpy.ndarray,
-              offset: int,
-              voltages: numpy.ndarray):
+              offset: int, voltages: numpy.ndarray):
+    """Evaluate effects of multiple potential opcode sequences and pick best.
+
+    We simulate the speaker voltage trajectory resulting from applying multiple
+    voltage profiles, compute the resulting squared error relative to the
+    target waveform, and pick the best one.
+
+    We use numpy to vectorize the computation since it has better scaling
+    performance with more opcode choices, although also has a larger fixed
+    overhead.
+    """
    positions = numpy.empty((voltages.shape[0], voltages.shape[1] + 1),
                            dtype=numpy.float32)
    positions[:, 0] = initial_position

    target_val = data[offset:offset + voltages.shape[1]]
-    # total_error = numpy.zeros(shape=voltages.shape[0], dtype=numpy.float32)
+    scaled_voltages = voltages / step_size
+
    for i in range(0, voltages.shape[1]):
-        positions[:, i + 1] = positions[:, i] + (
-                voltages[:, i] - positions[:, i]) / step_size
-        # err = numpy.power(numpy.abs(positions - target_val[i]), 2)
-        # total_error += err
-    try:
-        err = positions[:, 1:] - target_val
-    except ValueError:
-        print(offset, len(data), positions.shape, target_val.shape)
-        raise
+        positions[:, i + 1] = (
+                scaled_voltages[:, i] + positions[:, i] * (1 - 1 / step_size))
+    err = positions[:, 1:] - target_val
    total_error = numpy.sum(numpy.power(err, 2), axis=1)

    best = numpy.argmin(total_error)
    return best


+# TODO: share implementation with lookahead
 def evolve(opcode: opcodes.Opcode, starting_position, starting_voltage,
           step_size, data, starting_idx):
-    # Skip ahead to end of this opcode
+    """Apply the effects of playing a single opcode to completion.
+
+    Returns new state.
+    """
+
    opcode_length = opcodes.cycle_length(opcode)
-    voltages = starting_voltage * opcodes.CYCLE_SCHEDULE[opcode]
+    voltages = starting_voltage * opcodes.VOLTAGE_SCHEDULE[opcode]
    position = starting_position
    total_err = 0.0
    v = starting_voltage
@ -105,8 +82,10 @@ def evolve(opcode: opcodes.Opcode, starting_position, starting_voltage,
        total_err += err ** 2
    return position, v, total_err, starting_idx + opcode_length

-@profile
-def sample(data: numpy.ndarray, step: int, lookahead_steps: int):
+
+def audio_bytestream(data: numpy.ndarray, step: int, lookahead_steps: int):
+    """Computes optimal sequence of player opcodes to reproduce audio data."""
+
    dlen = len(data)
    data = numpy.concatenate([data, numpy.zeros(lookahead_steps)]).astype(
        numpy.float32)
@ -119,7 +98,9 @@ def sample(data: numpy.ndarray, step: int, lookahead_steps: int):
    eta = ETA(total=1000)
    i = 0
    last_updated = 0
-    while i < int(dlen / 100):
+    opcode_counts = collections.defaultdict(int)
+
+    while i < dlen:
        if (i - last_updated) > int((dlen / 1000)):
            eta.print_status()
            last_updated = i
@ -131,8 +112,10 @@ def sample(data: numpy.ndarray, step: int, lookahead_steps: int):

        opcode_idx = lookahead(step, position, data, i, voltage * voltages)
        opcode = pruned_opcodes[opcode_idx].opcodes[0]
+        opcode_counts[opcode] += 1
        yield opcode

+        # TODO: round position and memoize, and use in lookahead too
        position, voltage, new_error, i = evolve(
            opcode, position, voltage, step, data, i)

@ -140,18 +123,25 @@ def sample(data: numpy.ndarray, step: int, lookahead_steps: int):
        frame_offset = (frame_offset + 1) % 2048

    for _ in range(frame_offset % 2048, 2047):
-        yield opcodes.Opcode.NOTICK_5
+        yield opcodes.Opcode.NOTICK_6
    yield opcodes.Opcode.EXIT
    eta.done()
    print("Total error %f" % total_err)

+    print("Opcodes used:")
+    for v, k in sorted(list(opcode_counts.items()), key=lambda kv: kv[1],
+                       reverse=True):
+        print("%s: %d" % (v, k))
+

 def preprocess(
        filename: str, target_sample_rate: int,
        normalize: float = 0.5) -> numpy.ndarray:
+    """Upscale input audio to target sample rate and normalize signal."""
+
    data, _ = librosa.load(filename, sr=target_sample_rate, mono=True)

-    max_value = numpy.percentile(data, 90)
+    max_value = numpy.percentile(data, 100)
    data /= max_value
    data *= normalize

@ -161,13 +151,19 @@ def preprocess(
 def main(argv):
    serve_file = argv[1]
    step = int(argv[2])
+
+    # TODO: if we're not looking ahead beyond the longest (non-slowpath) opcode
+    # then this will reduce quality, e.g. a long NOTICK and TICK will
+    # both look the same over a too-short horizon, but have different results.
    lookahead_steps = int(argv[3])
    out = argv[4]

+    # TODO: PAL Apple ][ clock rate is slightly different
    sample_rate = int(1024. * 1000)
    data = preprocess(serve_file, sample_rate)
+
    with open(out, "wb+") as f:
-        for opcode in sample(data, step, lookahead_steps):
+        for opcode in audio_bytestream(data, step, lookahead_steps):
            f.write(bytes([opcode.value]))


--- a/opcodes.py
+++ b/opcodes.py
@ -4,67 +4,71 @@ import numpy
 from typing import Dict, List, Tuple, Iterable


+# TODO: support 6502 cycle counts as well
+
 class Opcode(enum.Enum):
-    TICK_12 = 0x00
-    TICK_17 = 0x08
-    TICK_15 = 0x09
-    TICK_13 = 0x0a
-    TICK_11 = 0x0b
-    TICK_9 = 0x0c
+    """Audio player opcodes representing atomic units of audio playback work."""
+    TICK_17 = 0x00
+    TICK_15 = 0x01
+    TICK_13 = 0x02
+    TICK_14 = 0x0a
+    TICK_12 = 0x0b
+    TICK_10 = 0x0c
+    NOTICK_6 = 0x0f

-    NOTICK_8 = 0x12
-    NOTICK_11 = 0x17
-    NOTICK_9 = 0x18
-    NOTICK_7 = 0x19
-    NOTICK_5 = 0x1a
-    EXIT = 0x1d
-    SLOWPATH = 0x2d
+    EXIT = 0x12
+    SLOWPATH = 0x22


-def make_tick_cycles(length) -> numpy.ndarray:
+def make_tick_voltages(length) -> numpy.ndarray:
+    """Voltage sequence for a NOP; ...; STA $C030; JMP (WDATA)."""
    c = numpy.full(length, 1.0, dtype=numpy.float32)
-    for i in range(length - 6, length):
+    for i in range(length - 7, length):  # TODO: 6502
        c[i] = -1.0
    return c


-def make_notick_cycles(length) -> numpy.ndarray:
+def make_notick_voltages(length) -> numpy.ndarray:
+    """Voltage sequence for a NOP; ...; JMP (WDATA)."""
    return numpy.full(length, 1.0, dtype=numpy.float32)


-def make_slowpath_cycles() -> numpy.ndarray:
-    length = 12 * 13
+def make_slowpath_voltages() -> numpy.ndarray:
+    """Voltage sequence for slowpath TCP processing."""
+    length = 8 * 14 + 10  # TODO: 6502
    c = numpy.full(length, 1.0, dtype=numpy.float32)
    voltage_high = True
-    for i in range(12):
+    for i in range(8):
        voltage_high = not voltage_high
-        for j in range(3 + 13 * i, min(length, 3 + 13 * (i + 1))):
+        for j in range(3 + 14 * i, min(length, 3 + 14 * (i + 1))):
            c[j] = 1.0 if voltage_high else -1.0
    return c


-# XXX rename to voltages
-CYCLE_SCHEDULE = {
-    Opcode.TICK_12: make_tick_cycles(12),
-    Opcode.TICK_17: make_tick_cycles(17),
-    Opcode.TICK_15: make_tick_cycles(15),
-    Opcode.TICK_13: make_tick_cycles(13),
-    Opcode.TICK_11: make_tick_cycles(11),
-    Opcode.TICK_9: make_tick_cycles(9),
-    Opcode.NOTICK_8: make_notick_cycles(8),
-    Opcode.NOTICK_11: make_notick_cycles(11),
-    Opcode.NOTICK_9: make_notick_cycles(9),
-    Opcode.NOTICK_7: make_notick_cycles(7),
-    Opcode.NOTICK_5: make_notick_cycles(5),
-    Opcode.SLOWPATH: make_slowpath_cycles()
+# Sequence of applied voltage inversions that result from executing each player
+# opcode, at each processor cycle.  We assume the starting applied voltage is
+# 1.0.
+VOLTAGE_SCHEDULE = {
+    Opcode.TICK_17: make_tick_voltages(17),
+    Opcode.TICK_15: make_tick_voltages(15),
+    Opcode.TICK_13: make_tick_voltages(13),
+    Opcode.TICK_14: make_tick_voltages(14),
+    Opcode.TICK_12: make_tick_voltages(12),
+    Opcode.TICK_10: make_tick_voltages(10),
+    Opcode.NOTICK_6: make_notick_voltages(6),
+    Opcode.SLOWPATH: make_slowpath_voltages(),
+
 }  # type: Dict[Opcode, numpy.ndarray]


 def cycle_length(op: Opcode) -> int:
-    return len(CYCLE_SCHEDULE[op])
+    """Returns the 65C02 cycle length of a player opcode."""
+    return len(VOLTAGE_SCHEDULE[op])


 class _Opcodes:
+    """Container for immutable Iterable[Opcode], to improve hash performance."""
+
    def __init__(self, opcodes: Iterable[Opcode]):
        self.opcodes = tuple(opcodes)
        self._hash = hash(self.opcodes)
@ -72,31 +76,48 @@ class _Opcodes:
    def __hash__(self):
        return self._hash

+
 # Guarantees each Tuple[Opcode] has a unique _Opcodes representation
-_OPCODES_CACHE = {}
+_OPCODES_SINGLETON = {}


@functools.lru_cache(None)
 def Opcodes(opcodes: Tuple[Opcode]):
-    return _OPCODES_CACHE.setdefault(opcodes, _Opcodes(opcodes))
+    """Returns unique _Opcodes representation for Tuple[Opcode]."""
+    return _OPCODES_SINGLETON.setdefault(opcodes, _Opcodes(opcodes))


@functools.lru_cache(None)
 def opcode_choices(frame_offset: int) -> List[Opcode]:
+    """Returns sorted list of valid opcodes for given frame offset.
+
+    Sorted by decreasing cycle length, so that if two opcodes produce equally
+    good results, we'll pick the one with the longest cycle count to reduce the
+    stream bitrate.
+    """
    if frame_offset == 2047:
        return [Opcode.SLOWPATH]

-    opcodes = set(CYCLE_SCHEDULE.keys()) - {Opcode.SLOWPATH}
-    # Prefer longer opcodes to have a more compact bytestream
-    # XXX if we aren't looking ahead beyond 1 opcode we should
-    # pick the shortest?
+    opcodes = set(VOLTAGE_SCHEDULE.keys()) - {Opcode.SLOWPATH}
    return sorted(list(opcodes), key=cycle_length, reverse=True)


+@functools.lru_cache(None)
+def opcode_lookahead(
+        frame_offset: int,
+        lookahead_cycles: int) -> Tuple[_Opcodes]:
+    """Computes all valid sequences of opcodes spanning lookahead_cycles."""
+
+    return tuple(Opcodes(ops) for ops in
+                 _opcode_lookahead(frame_offset, lookahead_cycles))
+
+
@functools.lru_cache(None)
 def _opcode_lookahead(
        frame_offset: int,
        lookahead_cycles: int) -> Tuple[Tuple[Opcode]]:
+    """Recursively enumerates all valid opcode sequences."""
+
    ch = opcode_choices(frame_offset)
    ops = []
    for op in ch:
@ -104,23 +125,14 @@ def _opcode_lookahead(
            ops.append((op,))
        else:
            for res in _opcode_lookahead((frame_offset + 1) % 2048,
-                                        lookahead_cycles - cycle_length(op)):
+                                         lookahead_cycles - cycle_length(op)):
                ops.append((op,) + res)
-    return tuple(ops)  # XXX type
-
-
-@functools.lru_cache(None)
-def opcode_lookahead(
-        frame_offset: int,
-        lookahead_cycles: int) -> Tuple[_Opcodes]:
-    return tuple(Opcodes(ops) for ops in
-                 _opcode_lookahead(frame_offset, lookahead_cycles))
-
-
-_CYCLES_CACHE = {}
+    return tuple(ops)  # TODO: fix return type


 class Cycles:
+    """Container for immutable Tuple[float], to improve hash performance."""
+
    def __init__(self, cycles: Tuple[float]):
        self.cycles = cycles
        self._hash = hash(cycles)
@ -129,22 +141,36 @@ class Cycles:
        return self._hash


+# Guarantees each Tuple[float] has a unique Cycles representation
+_CYCLES_SINGLETON = {}
+
+
@functools.lru_cache(None)
 def cycle_lookahead(
        opcodes: _Opcodes,
        lookahead_cycles: int
 ) -> Cycles:
+    """Computes the applied voltage effects of a sequence of opcodes.
+
+    i.e. produces the sequence of applied voltage changes that will result
+    from executing these opcodes, limited to the next lookahead_cycles.
+    """
    cycles = []
    for op in opcodes.opcodes:
-        cycles.extend(CYCLE_SCHEDULE[op])
+        cycles.extend(VOLTAGE_SCHEDULE[op])
    trunc_cycles = tuple(cycles[:lookahead_cycles])
-    return _CYCLES_CACHE.setdefault(trunc_cycles, Cycles(trunc_cycles))
+    return _CYCLES_SINGLETON.setdefault(trunc_cycles, Cycles(trunc_cycles))


@functools.lru_cache(None)
 def prune_opcodes(
        opcodes: Tuple[_Opcodes], lookahead_cycles: int
 ) -> Tuple[List[_Opcodes], numpy.ndarray]:
+    """Deduplicate a tuple of opcode sequences that are equivalent.
+
+    For each opcode sequence whose effect is the same when truncated to
+    lookahead_cycles, retains the first such opcode sequence.
+    """
    seen_cycles = set()
    pruned_opcodes = []
    pruned_cycles = []
@ -156,11 +182,4 @@ def prune_opcodes(
        pruned_opcodes.append(ops)
        pruned_cycles.append(cycles.cycles)

-    return pruned_opcodes, numpy.array(pruned_cycles, dtype=numpy.float32)
-
-
-if __name__ == "__main__":
-    lah = 50
-    ops = opcode_lookahead(0, lah)
-    pruned = prune_opcodes(ops, lah)
-    print(len(ops), len(pruned[0]))
+    return pruned_opcodes, numpy.array(pruned_cycles, dtype=numpy.float32)
--- a/player/player.dsk
+++ b/player/player.dsk
--- a/player/player.s
+++ b/player/player.s
@ -4,19 +4,22 @@
 ;  Created by Kris Kennaway on 27/07/2020.
 ;  Copyright © 2020 Kris Kennaway. All rights reserved.
 ;
-;  Delta modulation audio player for streaming audio over Ethernet (often called "BTC" in the Apple II community, after
-;  https://www.romanblack.com/picsound.htm who described various Apple II-like audio circuits and audio encoding
-;  algorithms).
+;  Delta modulation audio player for streaming audio over Ethernet.
 ;
-;  How this works is by modeling the Apple II speaker as an RC circuit.  When we tick the speaker it inverts the voltage
-;  across it, and the speaker responds by moving asymptotically towards the new level.  With some empirical tuning of
-;  the time constant of this RC circuit, we can precisely model how the speaker will respond to voltage changes, and use
-;  this to make the speaker "trace out" our desired waveform.  We can't do this precisely so there is some left-over
-;  quantization noise that manifests as background static.
+;  How this works is by modeling the Apple II speaker as an RC circuit.  Delta modulation with an RC circuit is often
+;  called "BTC", after https://www.romanblack.com/picsound.htm.
 ;
-;  This player uses a 13-cycle period, i.e. about 78.7KHz sampling rate.  We could go as low as 9 cycles for the period,
-;  but there is an audible 12.6KHz harmonic that I think is due to interference between the 9 cycle period and the
-;  every-65-cycle "long cycle" of the Apple II CPU.  13 cycles evenly divides 65 so this avoids the harmonic.
+;  When we tick the speaker it inverts the applied voltage across it, and the speaker responds by moving asymptotically
+;  towards the new level.  With some empirical tuning of the time constant of this RC circuit (which seems to be about
+;  500 us), we can precisely model how the speaker will respond to voltage changes, and use this to make the speaker
+;  "trace out" our desired waveform.  We can't do this precisely -- the speaker will zig-zag around the target waveform
+;  because we can only move it in finite steps -- so there is some left-over quantization noise that manifests as
+;  background static.
+;
+;  This player is capable of manipulating the speaker with 1-cycle precision, i.e. a 1MHz sampling rate, depending on
+;  how the "player opcodes" are chained together by the ethernet bytestream.  The catch is that once we have toggled
+;  the speaker we can't toggle it again until at least 10 cycles have passed, but we can pick any interval >= 10 cycles
+;  (except for 11 because of 6502 opcode timing limitations).
 ;
 ;  Some other tricks used here:
 ;
@ -27,7 +30,7 @@
 ;    byte stream to contain the low-order byte of the target address we want to jump to next.
 ;  - Since our 13-cycle period gives us 4 "spare" cycles over the minimal 9, that also lets us do a page-flipping trick
 ;    to visualize the audio bitstream while playing.
-;  - As with my II-Vision streaming video+audio player, we schedule a "slow path" dispatch to occur every 2KB in the
+;  - As with my ][-Vision streaming video+audio player, we schedule a "slow path" dispatch to occur every 2KB in the
 ;    byte stream, and use this to manage the socket buffers (ACK the read 2KB and wait until at least 2KB more is
 ;    available, which is usually non-blocking).  While doing this we need to maintain the 13 cycle cadence so the
 ;    speaker is in a known trajectory.  We can compensate for this in the audio encoder.
@ -92,6 +95,7 @@ STESTABLISHED = $17
 PRODOS      = $BF00 ; ProDOS MLI entry point
 RESET_VECTOR = $3F2 ; Reset vector
 COUT        = $FDED
+HOME        = $FC58

 TICK        = $C030 ; where the magic happens
 TEXTOFF     = $C050
@ -152,7 +156,7 @@ reset_w5100:
    STA WDATA ; SET RECEIVE BUFFER
    STA WDATA ; SET TRANSMIT BUFFER

-; CONFIGRE SOCKET 0 FOR TCP
+; CONFIGURE SOCKET 0 FOR TCP

    LDA #>S0MR
    STA WADRH
@ -260,93 +264,65 @@ setup:
    CPX #(end_copy_page1 - begin_copy_page1+1)
    BNE @0

-    ; pretty colours
-    STA TEXTOFF
-    STA FULLSCR
-
-    LDA #$22
-    LDX #$04
-    LDY #$08
-    JSR fill
-
-    LDA #$66
-    LDX #$08
-    LDY #$0c
-    JSR fill
+    ; clear screen
+    jsr HOME

    ; to restore after checkrecv
    LDY #>RXBASE
-    
    LDA #>S0RXRSR
    STA WADRH
    JMP checkrecv

-fill:
-    STX @1+2
-    STY @2+1
+; The actual player code, which will be copied to $3xx for execution
+;
+; opcode cycle counts are for 65c02, for 6502 they are 1 less because JMP (indirect) is 5 cycles instead of 6.

-    PHA
-@0:
-    PLA
-    LDX #$00
-@1:
-    STA $0400,X
-    INX
-    CPX #$78
-    BNE @1
-
-    PHA
-    CLC
-    LDA @1+1
-    ADC #$80
-    STA @1+1
-    LDA @1+2
-    ADC #$00
-    STA @1+2
-@2:
-    CMP #$08
-    BNE @0
-    PLA
-    RTS
-
-; The actual player code
+; TODO: evaluate whether it's worth adding longer NOTICK variants.  They are less commonly needed than TICK because
+; we typically don't want to leave the speaker alone for a long period of time - it's unlikely that the target waveform
+; exactly tracks what the speaker will do without intervention.

 begin_copy_page1:

-; $300
-tick_12: ; ticks on cycle 7 of 12
-  STA zpdummy
-  STA $C030
-  JMP (WDATA)
+; combinations of the following tick_even and tick_odd opcodes are enough to recover all tick intervals >= 10 cycles,
+; except for 11:
+;
+;   even tick intervals
+;     10 = TICK_10
+;     12 = TICK_12
+;     14 = TICK_14
+;     16 = NOTICK_6 + TICK_10
+;     18 = NOTICK_6 + TICK_12
+;     20 = NOTICK_6 + TICK_14
+;     22 = NOTICK_6 + NOTICK_6 + TICK_10
+;     24 = ...
+;
+;   odd tick intervals
+;     11 = ?
+;     13 = TICK_13
+;     15 = TICK_15
+;     17 = TICK_17
+;     19 = NOTICK_6 + TICK_13
+;     21 = NOTICK_6 + TICK_15
+;     23 = NOTICK_6 + TICK_17
+;     25 = NOTICK_6 + NOTICK_6 + TICK_13
+;     27 = ...

-; $308
-; ticks on cycle count 2n+4 out of 2n+9, minimum 4 out of 9
-; 9, 11, 13, 15, 17
-; only need up to tick_17 because others come from combinations
-tick_n_odd:
-  NOP
-  NOP
-  NOP
-  NOP
-  STA $C030
-  JMP (WDATA)
+; $300
+tick_odd: ; (NOTICK_6), (TICK_10), TICK_13, TICK_15, TICK_17
+    NOP ; 2
+    NOP ; 2
+    STA zpdummy ; 3
+    STA $C030 ; 4
+    JMP (WDATA) ; 6
+
+; $30a
+tick_even: ; NOTICK_6, TICK_10, TICK_12, TICK_14
+    NOP ; 2
+    NOP ; 2
+    STA $C030 ; 4
+    JMP (WDATA) ; 6

 ; $312
-notick_8:
-  STA zpdummy
-  JMP (WDATA)
-
-; $317
-; 2n+5 cycles, minimum 5
-; only need 5,7,9,11
-; then 13 = 8+5
-notick_n_odd:
-  NOP
-  NOP
-  NOP
-  JMP (WDATA)
-
-; $31d
 ; Quit to ProDOS
 exit:
    INC  RESET_VECTOR+2  ; Invalidate power-up byte
@ -363,15 +339,22 @@ exit_parmtable:

 ; Manage W5100 socket buffer and ACK TCP stream.
 ;
-; In order to simplify the buffer management we expect this ACK opcode to consume
-; the last 4 bytes in a 2K "TCP frame".  i.e. we can assume that we need to consume
-; exactly 2K from the W5100 socket buffer.
+; In order to simplify the buffer management we expect this ACK opcode to consume the last 4 bytes in a 2K "TCP frame".
+; i.e. we can assume that we need to consume exactly 2K from the W5100 socket buffer.
 ;
-; While during this we need to keep ticking the speaker every 13 cycles to maintain the same
-; net position of the speaker cone.  It might be possible to compensate for some other cadence in the encoder,
-; but this risks introducing unwanted harmonics.  We end up ticking 12 times assuming we don't stall waiting for
-; the socket buffer to refill.  In that case audio is already going to be disrupted though.
-slowpath: ;$32d
+; While during this we need to keep ticking the speaker at a regular cadence to maintain the same net position of the
+; speaker cone.  We choose to tick every 14 cycles, which requires adding in minimal NOP padding.
+;
+; We end up ticking 8 times with 10 cycles left over, assuming we don't stall waiting for the socket buffer to refill.
+;
+; From the point of view of speaker voltages this slowpath is equivalent to the following opcode sequence:
+; TICK_6 (TICK_14 * 7) with 4 cycles left over, adding 4 to the effective n of the next TICK_n we jump to (as chosen by
+; the encoder).
+;
+; If we do stall waiting for data then there is no need to worry about maintaining an even cadence, because audio
+; will already be disrupted (since the encoder won't have predicted it, so will be tracking wrong).  The speaker will
+; resynchronize within a few hundred microseconds though.
+slowpath: ;$322
    STA TICK ; 4
    
    ; Save the W5100 address pointer so we can come back here later
@ -381,73 +364,49 @@ slowpath: ;$32d
    
    ; Read Received Read pointer
    LDA #>S0RXRD ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
    STA WADRH ; 4
-    
-    LDX #<S0RXRD ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
-    STX WADRL ; 4
-    NOP ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4 [ 13]
+    STA TICK ; 4 [14]

+    LDX #<S0RXRD ; 2
+    STX WADRL ; 4
    LDA WDATA ; 4 Read high byte
+    STA TICK ; 4 [14]
+
    ; No need to read low byte since it's guaranteed to be 0 since we're at the end of a 2K frame.

    ; Update new Received Read pointer
    ; We have received an additional 2KB
    CLC ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
    ADC #$08 ; 2

    STX WADRL ; 4 Reset address pointer, X still has #<S0RXRD
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
+    NOP ; 2
+    STA TICK ; 4 [14]

    STA WDATA ; 4 Store new high byte
    ; No need to store low byte since it's unchanged at 0

    ; Send the Receive command
    LDA #<S0CR ; 2
-
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
    STA WADRL ; 4
+    STA TICK ; 4 [14]

    LDA #SCRECV ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
    STA WDATA ; 4
    
 checkrecv:
    LDA #<S0RXRSR   ; 2 Socket 0 Received Size register
-    STA zpdummy ; 3
+    LDX #$07 ; 2
+    STA TICK ; 4 [14]

    ; we might loop an unknown number of times here waiting for data but the default should be to fall
    ; straight through
@0:
-    STA TICK        ; 4
    STA WADRL       ; 4
-    LDX #$07; 2  could move out of loop but need to pad cycles anyway
-    STA zpdummy ; 3
-    STA TICK ; 4 [13]
-    
    CPX WDATA       ; 4 High byte of received size
-
-    BCC @1          ; 2
-    BCS @0          ; 3
-    
-@1:
    NOP ; 2
-    STA TICK ; 4 [13]
+    STA TICK ; 4 [14]
+    BCS @0          ; 2 in common case when there is already sufficient data waiting.

    ; point W5100 back into the RX buffer where we left off
    ; There is data to read - we don't care exactly how much because it's at least 2K
@ -458,10 +417,10 @@ checkrecv:
    ; Since we're using an 8K socket, that means we don't have to do any work to manage the read pointer!
    STY WADRH  ; 4
    LDX #$00 ; 2
-    STA zpdummy ; 3
-    STA TICK ; 4
+    NOP ; 2
+    STA TICK ; 4 [14]
    
    STX WADRL  ; 4
-    JMP (WDATA) ; 5
+    JMP (WDATA) ; 6 [10/14]
 end_copy_page1:
 .endproc
--- a/preprocess_audio.py
+++ b/preprocess_audio.py
@ -1,28 +0,0 @@
-import sys
-import librosa
-import numpy
-import soundfile as sf
-
-
-def preprocess(
-        filename: str, target_sample_rate: int,
-        normalize: float = 0.5) -> numpy.ndarray:
-    data, _ = librosa.load(filename, sr=target_sample_rate, mono=True)
-
-    max_value = numpy.percentile(data, 90)
-    data /= max_value
-    data *= normalize
-
-    return data
-
-
-def main(argv):
-    serve_file = argv[1]
-    out = argv[2]
-    sample_rate = int(1024. * 1000)
-
-    sf.write(out, preprocess(serve_file, sample_rate), sample_rate)
-
-
-if __name__ == "__main__":
-    main(sys.argv)