Commit Graph

134 Commits

Author SHA1 Message Date
kris
8c23824aa6 Optimize more. Should now fit in main memory?! 2019-03-05 22:44:39 +00:00
kris
3cd44cd891 Optimize down to about 42k required for full set of page opcodes
- enough to fit in AUX RAM but still room to go, hopefully will be
  able to fit in MAIN?

Fix some of the off-by-one cycle counts introduced when switching from
STA tick (which is wrong since it accesses twice) to BIT tick.
Hopefully can fix others by reordering?
2019-03-05 22:22:35 +00:00
kris
12d48b664a Prototype audio-only player, which uses a different strategy for
scheduling audio + video.

Use a combined "fat" audio + video opcode that combines several
features:
- constant cycle count of 73 cycles/opcode (=14364 Hz)
- page and content are controlled per opcode
- each opcode does 4 offset stores (hence 57456 stores/sec)
- tick speaker twice per opcode, with varying duty cycles 4 .. 70 in
  units of 2 cycles
  - thus 32 opcodes, or 5-bit audio @ 14364 Hz

The price for this is that we need per-page variants of the opcodes,
and at 53 bytes/opcode they won't (quite) all fit even in AUX RAM.

The good news is that with some further work it should be possible
to reduce this footprint by having opcodes share implementation by
JMPing into a common tail sequence.

Also introduce some ticks in approximately correct places during the
ACK slow path, as a proof of concept that this does mitigate the
clicking.

This works and gives reasonable quality audio!
2019-03-05 21:05:41 +00:00
kris
340a3005d8 Support old-style opcodes that use relative branch addressing, and new
cycle-counted tick opcodes that use absolute addressing.

For now switch to the experimental audio-only player.
2019-03-05 20:51:05 +00:00
kris
2f12407d3c Extract audio channel from movie file and emit 5-bit audio opcodes
at 14KHz.
2019-03-05 20:47:34 +00:00
kris
6e2c83c1e5 Introduction more general notion of update priority used to increase
weight of diffs that persist across multiple frames.

For each frame, zero out update priority of bytes that no longer have
a pending diff, and add the edit distance of the remaining diffs.

Zero these out as opcodes are retired.

Replace hamming distance with Damerau-Levenshtein distance of the
encoded pixel colours in the byte, e.g. 0x2A --> GGG0 (taking into
account the half-pixel)

This has a couple of benefits over hamming distance of the bit patterns:
- transposed pixels are weighted less (edit distance 1, not 2+ for
  Hamming)
- coloured pixels are weighted equally as white pixels (not half as
  much)
- weighting changes in palette bit that flip multiple pixel colours

While I'm here, the RLE opcode should emit run_length - 1 so that we
can encode runs of 256 bytes.
2019-03-04 23:09:00 +00:00
kris
d3522c817f Randomize tie-breaker when pages etc have the same weight, so we
don't consistently prefer larger numbers.

There still seems to be a bug somewhere causing some screen regions to
be consistently not updated, but perhaps I'll find it when working on
the logic to penalize persistent diffs.
2019-03-03 23:25:10 +00:00
kris
a6f32886cd Refactor the various representations of screen memory (bitmap, (x,y)
bytemap, (page,offset) memory map)
- add a FlatMemoryMap that is a linear 8K array
- add converter methods and default constructors that allow converting
  between them
- use MemoryMap as the central representation used by the video encoder
2019-03-03 22:21:28 +00:00
kris
80402f25a5 - Allow HGR ROM entry point
- Don't trap unexpected entrypoint when crossing between regions via RTS
- Implement TICK handler
- Improve status printing in CPU loop
2019-02-27 22:46:53 +00:00
kris
90f696b8e4 Bare-bones py65-based simulator for Apple //e with Uthernet (i.e.
simulating the W5100).  This will hopefully be useful for
troubleshooting and testing player behaviour more precisely, e.g.

- trapping read/write access to unexpected memory areas
- asserting invariants on the processor state across loops
- measuring cycle timing
- tracing program execution

This already gets as far as negotiating the TCP connect.  The major
remaining piece seems to be the TCP buffer management on the W5100 side.
2019-02-27 22:26:35 +00:00
kris
2b3343f374 Encode audio file into cycle timings and emit tick opcodes. Amazingly,
even the naive opcode implementation works!

The main issue is that when we ACK, the speaker cone is allowed to tick
fully.  Maybe optimizing the ACK codepath to be fast enough will help
with this?
2019-02-27 14:49:21 +00:00
kris
9d4edc6c4a Compute median frame similarity. This turns out not to be a great
metric though, because it doesn't penalize artifacts like colour
fringing, or diffs that persist across many frames.
2019-02-27 14:10:39 +00:00
kris
4840efc41e In HeuristicPageFirstScheduler, don't use a deterministic ordering
of pages and content, since we may never get around to some of them
across many frames.  Instead weight by total xor weight for the page,
(page, content) tuple and offset list

Add some other scheduler variants
- prefer content first, then page.  This turns out to introduce a lot
  of colour fringing since we may not ever get back to fix up the
  hanging bit
2019-02-27 14:09:42 +00:00
kris
0ac905a7aa Use decoder symbol table to populate start/end addresses for opcodes. 2019-02-27 12:10:56 +00:00
kris
c139e8bf1b Write symbol table to .dbg file when assembling player
Add explicit end_{opcode} labels to mark (1 byte past) end of opcode.

Rename op_done to op_terminate to match opcode name in encoder.

Extract symbol table in encoder and use this to populate the opcode
start/end addresses.
2019-02-27 12:10:14 +00:00
kris
86066fec61 Makefile 2019-02-24 00:04:13 +00:00
kris
9da18f0ecc Initial working version of video player.
The basic strategy is that we remove as much conditional evaluation as
possible from the inner decode loop.

e.g. rather than doing opcode dispatch by some kind of table lookup
(etc), this is precomputed on the server side.  The next opcode in the
stream is encoded as a branch offset to that opcode's first instruction,
and we modify the BRA instruction in place to dispatch there.

TCP buffer management is also offloaded to the server side; we rely on
the server to explicitly schedule an ACK opcode every 2048 bytes to
drop us into a slow path where we move the W5100 read pointer, send
the TCP ACK, and block until the read socket has enough data to
continue with.

This outer loop is overly conservative (e.g. since we're performing
exactly known read sizes we can omit a lot of duplicate bookkeeping),
i.e. there is a lot of room for optimizing this.

Experimental (i.e. not working yet) support for audio delay loop;
we should be able to leverage the way we do offset-based dispatch to
implement variable-delay loops with some level of cycle resolution.
2019-02-24 00:03:36 +00:00
kris
1b54c9c864 Video() is now aware of target frame rate, and will continue to emit
opcodes until the cycle budget for the frame is exhausted.

Output stream is also now aware of TCP framing, and schedules an ACK
opcode every 2048 output bytes to instruct the client to perform
TCP ACK and buffer management.

Fixes several serious bugs in RLE encoding, including:

- we were emitting the RLE opcode with the next content byte after the
  run completed!
- we were looking at the wrong field for the start offset!
- handle the case where the entire page is a single run
- stop trying to allow accumulating error when RLE -- this does not
  respect the Apple II colour encoding, i.e. may introduce colour
  fringing.
- also because of this we're unlikely to actually be able to find
  many runs because odd and even columns are encoded differently.  In
  a followup we should start encoding odd and even columns separately

Optimize after profiling -- encoder is now about 2x faster

Add tests.
2019-02-23 23:52:25 +00:00
kris
cc6c92335d Implement a much more efficient mechanism for mapping an array between
(x, y) indexing and (page, offset) indexing.  This uses numpy to
construct a new array by indexing into the old one.

In benchmarking this is something like 100x faster.
2019-02-23 23:44:29 +00:00
kris
4178c191db Update cycle timing from working ethernet player.
Add _START and _END addresses that are used by the byte stream to
vector the program counter to the next opcode in the stream.

Support equality testing of opcodes and add tests.

Add an ACK opcode for instructing the client to ACK the TCP stream.

Tick opcode now accepts a cycle argument, for experimenting with
audio support.
2019-02-23 23:38:14 +00:00
kris
e0ab30d074 Fix deprecation warning on newer numpy
Similarity metric should be a float
2019-02-23 23:33:18 +00:00
kris
e4174ed10b Extract out input video decoding into separate module.
Prototype a threaded version of the decoder but this doesn't seem to be
necessary as it's not the bottleneck.

Opcode stream is now aware of frame cycle budget and will keep emitting
until budget runs out -- so no need for fullness estimate.
2019-02-23 23:32:07 +00:00
kris
dc671986a3 Send contents of out.bin file 2019-02-23 23:28:33 +00:00
kris
7deed24ac4 Rename opcode 2019-01-05 23:51:21 +00:00
kris
36fc34d26d Refactor the world 2019-01-05 23:31:56 +00:00
kris
84611ad5e3 WIP: modified version of echo server that reads in page-sized chunks 2019-01-03 23:25:58 +00:00
kris
c797852324 - Stop masking out unchanged bytes explicitly and compare the full
source vs target frame.  This allows us to accumulate runs across
unchanged bytes, if they happen to be the same content value.

- introduce an allowable bit error when building runs, i.e. trade
  some slight imprecision for much more efficient decoding.  This gives
  a slight (~2%) reduction in similarity on my test frames at 140 pixels
  but improves the 280 pixel similarity significantly (~7%)

- so make 280 pixels the default for now

- once the run is complete, compute the median value of each bit in
  the run and use that as content byte.  I also tried mean which had
  exactly the same output

- runs will sometimes now span the (0x7x) screen holes so for now just
  ignore invalid addresses in _write
2019-01-03 17:38:47 +00:00
kris
1c13352106 Implement RLE support, which is more efficient than byte-wise stores
for runs of N >= 4.

Also fix a bug in the decoder that was apparently allowing opcodes to
fall through.  Replace BVC with BRA (i.e. assume 65C02) until I can work
out what is going on
2019-01-03 14:51:57 +00:00
kris
ab4b4f22fd Refactor opcode schedulers and implement one based on the ortools TSP
solver to minimize the cycle cost to visit all changes in our estimated
list.

This is fortunately a tractable (though slow) computation that does give
improvements on the previous heuristic at the level of ~6% better
throughput.

This opcode schedule prefers to group by page and vary over content, so
implement a fast heuristic that does that.  This scheduler is within 2%
of the TSP solution.
2019-01-02 23:10:03 +00:00
kris
a8688a6a7e Memoize hamming_weight to optimize runtime 2019-01-02 22:25:16 +00:00
kris
6de5f1797d Reimplement opcode scheduler to one that is ~as fast as before. As a
bonus we now maintain much better tracking of our target frame rate.

Maintain a running estimate of the opcode scheduling overhead, i.e.
how many opcodes we end up scheduling for each content byte written.

Use this to select an estimated number of screen changes to fill the
cycle budget, ordered by hamming weight of the delta.  Group these
by content byte and then page as before.
2019-01-02 22:16:54 +00:00
kris
8e3f8c9f6d Implement a measure of similarity for two bool arrays and use it to measure
how close we are getting to encoding the target image.
2019-01-02 00:24:25 +00:00
kris
7c5e64fb6f Optimize for cycles/pixel by weighting each output byte by the hamming
weight of the xor of old and new frames, and switch to setting the
new byte directly instead of xor'ing, to improve efficiency of decoder.

Instead of iterating in a fixed order by target byte then page, at
each step compute the next change to make that would maximize
cycles/pixel, including switching page and/or content byte.

This is unfortunately much slower to encode currently but can hopefully
be optimized sufficiently.
2019-01-02 00:03:21 +00:00
kris
0b78e2323a Initial working version of image encoder. This tries to optimize the
bytestream by prioritizing bytes to be XOR'ed that have the highest
hamming weight, i.e. will result in the largest number of pixel
transitions on the screen.

Not especially optimized yet (either runtime, or byte stream)
2019-01-01 21:50:01 +00:00