The hope was that this would reduce the amount of copying and bit
shifting required by the frontend to get the data on screen, but
it doesn't seem to offer much advantage, surprisingly. I'll leave
it in though. There are a few other minor tweaks included here to
try to improve the performance a bit
I wanted to make this a bit more modular, so it's easier in theory to
write external crates that can reuse bits, and selectively compile in
bits, such as adding new systems or new cpu implementations