- Change ACK code to perform two dummy stream reads rather than relying
on a preceding NOP to pad the TCP frame to 2K. This fixes the timing
issue that was causing most of the low-frequency ticks.
- Ticks still aren't perfectly aligned during the ACK slow path but
it's almost good enough, i.e. probably no need to actually bother
optimizing the slow path more.
of all possible weights. We encode the two 8-bit inputs into a single
16 bit value instead of dealing with an array of tuples.
Fix an important bug in _compute_delta: old and is_odd were transposed
so we weren't actually subtracting the old deltas! Surprisingly,
when I accidentally fixed this bug in the vectorized version, the video
encoding was much worse! This turned out to be because the edit
distance metric allowed reducing diffs by turning on pixels, which
meant it would tend to do this when "minimizing error" in a way that
was visually unappealing.
To remedy this, introduce a separate notion of substitution cost for
errors, and weight pixel colour changes more highly to discourage them
unless absolutely necessary. This gives very good quality results!
Also vectorize the selection of page offsets and priorities having
a negative error delta, instead of heapifying the entire page.
Also it turns out to be a bit faster to compute (and memoize) the delta
between a proposed content byte and the entire target screen at once,
since we'll end up recomputing the same content diffs multiple times.
(Committing messy version in case I want to revisit some of those
interim versions)
has ticked past the appropriate time.
- optimize the frame encoding a bit
- use int64 consistently to avoid casting
Fix a bug - when retiring an offset, also update our memory map with the
new content, oops
If we run out of changes to index, keep emitting stores for content
at page=32,offset=0 forever
Switch to a weighted D-L implementation so we can weight e.g. different
substitutions differently (e.g. weighting diffs to/from black pixels
differently than color errors)
0x800-0x2000
Place some of the tick opcodes there. This gives enough room for all
but 2 of the op_tick_*_page_n opcodes!
It may be possible to fit the remaining ones into unused RAM in the
language card, but this will require some finesse to get the code in
there. Or maybe I can optimize enough bytes...
0x300 is used by the loader.system, but there is also still 0x400..0x800
if I don't mind messing up the text page, and 0x200 if I can
get away with using the keyboard buffer.
Something is broken with RESET now though, maybe the reset vector is
pointing somewhere orphaned.
- Introduce a new Movie() class that multiplexes audio and video.
- Every N audio frames we grab a new video frame and begin pulling
opcodes from the audio and video streams
- Grab frames from the input video using bmp2dhr if the .BIN file does
not already exist. Run bmp2dhr in a background thread to not block
encoding
- move the output byte streaming from Video to Movie
- For now, manually clip updates to pages > 56 since the client doesn't
support them yet
The way we encode video is now:
- iterate in descending order over update_priority
- begin a new (page, content) opcode
- for all of the other offset bytes in that page, compute the error
between the candidate content byte and the target content byte
- iterate over offsets in order of increasing error and decreasing
update_priority to fill out the remaining opcode
trick of temporarily violating the X=0 invariant (which is only
required in the tick_6 opcode tail path to steal an extra cycle)
to reorder a STA $2000,Y outside of the tick loop.
The cost of this is that we don't have enough pad cycles left to JMP
to the common opcode tail, but I think this still (barely) fits in
main RAM.
- enough to fit in AUX RAM but still room to go, hopefully will be
able to fit in MAIN?
Fix some of the off-by-one cycle counts introduced when switching from
STA tick (which is wrong since it accesses twice) to BIT tick.
Hopefully can fix others by reordering?
scheduling audio + video.
Use a combined "fat" audio + video opcode that combines several
features:
- constant cycle count of 73 cycles/opcode (=14364 Hz)
- page and content are controlled per opcode
- each opcode does 4 offset stores (hence 57456 stores/sec)
- tick speaker twice per opcode, with varying duty cycles 4 .. 70 in
units of 2 cycles
- thus 32 opcodes, or 5-bit audio @ 14364 Hz
The price for this is that we need per-page variants of the opcodes,
and at 53 bytes/opcode they won't (quite) all fit even in AUX RAM.
The good news is that with some further work it should be possible
to reduce this footprint by having opcodes share implementation by
JMPing into a common tail sequence.
Also introduce some ticks in approximately correct places during the
ACK slow path, as a proof of concept that this does mitigate the
clicking.
This works and gives reasonable quality audio!
weight of diffs that persist across multiple frames.
For each frame, zero out update priority of bytes that no longer have
a pending diff, and add the edit distance of the remaining diffs.
Zero these out as opcodes are retired.
Replace hamming distance with Damerau-Levenshtein distance of the
encoded pixel colours in the byte, e.g. 0x2A --> GGG0 (taking into
account the half-pixel)
This has a couple of benefits over hamming distance of the bit patterns:
- transposed pixels are weighted less (edit distance 1, not 2+ for
Hamming)
- coloured pixels are weighted equally as white pixels (not half as
much)
- weighting changes in palette bit that flip multiple pixel colours
While I'm here, the RLE opcode should emit run_length - 1 so that we
can encode runs of 256 bytes.
don't consistently prefer larger numbers.
There still seems to be a bug somewhere causing some screen regions to
be consistently not updated, but perhaps I'll find it when working on
the logic to penalize persistent diffs.
bytemap, (page,offset) memory map)
- add a FlatMemoryMap that is a linear 8K array
- add converter methods and default constructors that allow converting
between them
- use MemoryMap as the central representation used by the video encoder
simulating the W5100). This will hopefully be useful for
troubleshooting and testing player behaviour more precisely, e.g.
- trapping read/write access to unexpected memory areas
- asserting invariants on the processor state across loops
- measuring cycle timing
- tracing program execution
This already gets as far as negotiating the TCP connect. The major
remaining piece seems to be the TCP buffer management on the W5100 side.
even the naive opcode implementation works!
The main issue is that when we ACK, the speaker cone is allowed to tick
fully. Maybe optimizing the ACK codepath to be fast enough will help
with this?
of pages and content, since we may never get around to some of them
across many frames. Instead weight by total xor weight for the page,
(page, content) tuple and offset list
Add some other scheduler variants
- prefer content first, then page. This turns out to introduce a lot
of colour fringing since we may not ever get back to fix up the
hanging bit
Add explicit end_{opcode} labels to mark (1 byte past) end of opcode.
Rename op_done to op_terminate to match opcode name in encoder.
Extract symbol table in encoder and use this to populate the opcode
start/end addresses.
The basic strategy is that we remove as much conditional evaluation as
possible from the inner decode loop.
e.g. rather than doing opcode dispatch by some kind of table lookup
(etc), this is precomputed on the server side. The next opcode in the
stream is encoded as a branch offset to that opcode's first instruction,
and we modify the BRA instruction in place to dispatch there.
TCP buffer management is also offloaded to the server side; we rely on
the server to explicitly schedule an ACK opcode every 2048 bytes to
drop us into a slow path where we move the W5100 read pointer, send
the TCP ACK, and block until the read socket has enough data to
continue with.
This outer loop is overly conservative (e.g. since we're performing
exactly known read sizes we can omit a lot of duplicate bookkeeping),
i.e. there is a lot of room for optimizing this.
Experimental (i.e. not working yet) support for audio delay loop;
we should be able to leverage the way we do offset-based dispatch to
implement variable-delay loops with some level of cycle resolution.
opcodes until the cycle budget for the frame is exhausted.
Output stream is also now aware of TCP framing, and schedules an ACK
opcode every 2048 output bytes to instruct the client to perform
TCP ACK and buffer management.
Fixes several serious bugs in RLE encoding, including:
- we were emitting the RLE opcode with the next content byte after the
run completed!
- we were looking at the wrong field for the start offset!
- handle the case where the entire page is a single run
- stop trying to allow accumulating error when RLE -- this does not
respect the Apple II colour encoding, i.e. may introduce colour
fringing.
- also because of this we're unlikely to actually be able to find
many runs because odd and even columns are encoded differently. In
a followup we should start encoding odd and even columns separately
Optimize after profiling -- encoder is now about 2x faster
Add tests.
(x, y) indexing and (page, offset) indexing. This uses numpy to
construct a new array by indexing into the old one.
In benchmarking this is something like 100x faster.
Add _START and _END addresses that are used by the byte stream to
vector the program counter to the next opcode in the stream.
Support equality testing of opcodes and add tests.
Add an ACK opcode for instructing the client to ACK the TCP stream.
Tick opcode now accepts a cycle argument, for experimenting with
audio support.
Prototype a threaded version of the decoder but this doesn't seem to be
necessary as it's not the bottleneck.
Opcode stream is now aware of frame cycle budget and will keep emitting
until budget runs out -- so no need for fullness estimate.