Add explicit end_{opcode} labels to mark (1 byte past) end of opcode.
Rename op_done to op_terminate to match opcode name in encoder.
Extract symbol table in encoder and use this to populate the opcode
start/end addresses.
The basic strategy is that we remove as much conditional evaluation as
possible from the inner decode loop.
e.g. rather than doing opcode dispatch by some kind of table lookup
(etc), this is precomputed on the server side. The next opcode in the
stream is encoded as a branch offset to that opcode's first instruction,
and we modify the BRA instruction in place to dispatch there.
TCP buffer management is also offloaded to the server side; we rely on
the server to explicitly schedule an ACK opcode every 2048 bytes to
drop us into a slow path where we move the W5100 read pointer, send
the TCP ACK, and block until the read socket has enough data to
continue with.
This outer loop is overly conservative (e.g. since we're performing
exactly known read sizes we can omit a lot of duplicate bookkeeping),
i.e. there is a lot of room for optimizing this.
Experimental (i.e. not working yet) support for audio delay loop;
we should be able to leverage the way we do offset-based dispatch to
implement variable-delay loops with some level of cycle resolution.
opcodes until the cycle budget for the frame is exhausted.
Output stream is also now aware of TCP framing, and schedules an ACK
opcode every 2048 output bytes to instruct the client to perform
TCP ACK and buffer management.
Fixes several serious bugs in RLE encoding, including:
- we were emitting the RLE opcode with the next content byte after the
run completed!
- we were looking at the wrong field for the start offset!
- handle the case where the entire page is a single run
- stop trying to allow accumulating error when RLE -- this does not
respect the Apple II colour encoding, i.e. may introduce colour
fringing.
- also because of this we're unlikely to actually be able to find
many runs because odd and even columns are encoded differently. In
a followup we should start encoding odd and even columns separately
Optimize after profiling -- encoder is now about 2x faster
Add tests.
(x, y) indexing and (page, offset) indexing. This uses numpy to
construct a new array by indexing into the old one.
In benchmarking this is something like 100x faster.
Add _START and _END addresses that are used by the byte stream to
vector the program counter to the next opcode in the stream.
Support equality testing of opcodes and add tests.
Add an ACK opcode for instructing the client to ACK the TCP stream.
Tick opcode now accepts a cycle argument, for experimenting with
audio support.
Prototype a threaded version of the decoder but this doesn't seem to be
necessary as it's not the bottleneck.
Opcode stream is now aware of frame cycle budget and will keep emitting
until budget runs out -- so no need for fullness estimate.
source vs target frame. This allows us to accumulate runs across
unchanged bytes, if they happen to be the same content value.
- introduce an allowable bit error when building runs, i.e. trade
some slight imprecision for much more efficient decoding. This gives
a slight (~2%) reduction in similarity on my test frames at 140 pixels
but improves the 280 pixel similarity significantly (~7%)
- so make 280 pixels the default for now
- once the run is complete, compute the median value of each bit in
the run and use that as content byte. I also tried mean which had
exactly the same output
- runs will sometimes now span the (0x7x) screen holes so for now just
ignore invalid addresses in _write
for runs of N >= 4.
Also fix a bug in the decoder that was apparently allowing opcodes to
fall through. Replace BVC with BRA (i.e. assume 65C02) until I can work
out what is going on
solver to minimize the cycle cost to visit all changes in our estimated
list.
This is fortunately a tractable (though slow) computation that does give
improvements on the previous heuristic at the level of ~6% better
throughput.
This opcode schedule prefers to group by page and vary over content, so
implement a fast heuristic that does that. This scheduler is within 2%
of the TSP solution.
bonus we now maintain much better tracking of our target frame rate.
Maintain a running estimate of the opcode scheduling overhead, i.e.
how many opcodes we end up scheduling for each content byte written.
Use this to select an estimated number of screen changes to fill the
cycle budget, ordered by hamming weight of the delta. Group these
by content byte and then page as before.
weight of the xor of old and new frames, and switch to setting the
new byte directly instead of xor'ing, to improve efficiency of decoder.
Instead of iterating in a fixed order by target byte then page, at
each step compute the next change to make that would maximize
cycles/pixel, including switching page and/or content byte.
This is unfortunately much slower to encode currently but can hopefully
be optimized sufficiently.
bytestream by prioritizing bytes to be XOR'ed that have the highest
hamming weight, i.e. will result in the largest number of pixel
transitions on the screen.
Not especially optimized yet (either runtime, or byte stream)