5 Roadmap for Flexibly Timed Platforms
Thomas Harte edited this page 2020-12-25 10:45:20 -04:00

Definition

A flexibly-timed platform is any where exact bus timings, CLK's current stock in trade, are not well-defined or not of the essence.

Unambiguous examples include a 486 PC or a PowerPC Mac; on neither of these platforms does software depend on exact bus timings. Pragmatically speaking it would therefore be detrimental to the end user for CLK to implement them given the greater development and substantially greater runtime costs.

Straddling Platforms

Some platforms begin with single, well-defined bus timings and then grow beyond them. On a macro level this is true of machines like the IBM PC and the Acorn Archimedes; in both cases some software depends on the exact timings of the original versions of the machine, despite the platform subsequently growing in scope such that a strong connection with guaranteed timing is lost.

On a micro level, it is true of some processor families — CLK will generally need to provide bus-accurate implementations of early versions of processors, such as the 68000, while being able to provide only flexibly-timed versions of later family members, such as the 68040.

Intended Implementation

  1. an instruction decoder;
  2. feeds a JIT threaded-code generator;
  3. which is constituted of function calls to suitably-templated functions such as pragmatically to minimise repeat decoding.

For processor families like the 68k, the same instruction decoder can be applied both by the threaded JIT generator and by bus-centric implementations of as many of the early members of the family as is justified.

Instruction Decoders

These will necessarily be unique per CPU family. Expectations off the top of my head:

  • the 68000 decoder can be stateless and predicated on a simple 16-bit word to decoded instruction plus length;
  • the PowerPC and ARM decoders (ignoring Thumb) can similarly be stateless, mapping 32-bit words to decoded instructions;
  • x86 decoders will need to be stateful, with an interface offering it a buffer of bytes with a length, returning either: (i) a fully decoded instruction and its length; or (ii) an indication that more bytes are required, and an indication of how many if that is currently knowable.

Fully-decoded instructions will be fixed-size descriptive structs or bit fields, presumably usually larger than the input data.

JIT Threaded-Code Generator

This will likely be a simple list of function pointers; each function will be provided with processor state and the fully-decoded instruction. Executing linearly will likely involve just calling each in turn. Something like a std::map from program counter to entry point can be kept per translated page in order to retain those entry points that are actually in use and some architectures may require multiple JIT streams per page, subject to instruction alignment rules and the possibility of positioning the program counter partway through what is otherwise a single instruction.

It's likely that a suitably generic implementation of the page cache could be used for all relevant architectures.

Templated Instructions

For the purpose of call cleanliness, it's likely that some degree of instruction composition via templating will be appropriate.

E.g. for the 8086 that might mean providing the decoded version of the ModRM byte as a template parameter in addition to being an argument, permitting addressing mode to be selected by constexpr.

What to include in the template signature will be a question of measure and degree, e.g. it would make little sense to include all three registers nominated by some PowerPC instructions in the corresponding function template as overloading the emulator with generated code would not be a path to good performance.

Likely Development Path

  1. factor out the proto-decoder that currently resides within the bus-centric 68000;
  2. extend it also to decode 68020, 68030 and 68040 instruction sets;
  3. build the JIT threaded-code generator and deploy it coupled to the 68k decoder to provide flexibly-timed 68020, 68030 and 68040s;
  4. use those to build out the higher-function pre-PowerPC Macintoshes.

With a workable pattern suitably discovered, apply the same to other architectures.

Z80 and 6502 Notes

Both the Z80 and 6502 both almost have instruction decoders themselves, in the form of the disassemblers. It would probably make sense to formalise those; it also wouldn't necessarily be unproductive:

  • Apple used embedded 6502 cores for various helper chips — the IOPs in the Quadras, the ADB controller in the IIgs and elsewhere, the latter of which I was seriously considering a static recompilation of in any event, so that could be an alternative baby step into JIT threaded world;
  • also in 6502 world, it would simplify the logic flow for instruction set selection between the 6502, various 65C02s and possibly the 65816; and
  • for the purposes of the Z80, it might aid in unification of CLK and CP/M for OS X (and, from there, maybe adding support for CP/M-86 as a route into the 8086?).

If a JITty implementation of the IIgs's ADB controller goes well, it might also be the smart move to add an HD6301 for the benefit of the ST and a better IKBD emulation.

Practicality of Templating

I tested the following code, an obvious implementation, with GCC:

#include <cstdio>

template<int x> struct MultiplyByX {
        static int apply(int y) {
                return x*y;
        }
};

template<template<int x> typename Operation, int size> class OperationBank {
        public:
                OperationBank() {
                        fill<0, size>();
                }

                int apply(int x, int y) {
                        return funcs[x](y);
                }

        private:
                template<int offset, int depth> void fill() {
                        if constexpr (!depth) {
                                return;
                        }

                        if constexpr (depth & 1) {
                                funcs[offset + depth - 1] = &Operation<offset + depth - 1>::apply;
                        }

                        fill<offset, depth / 2>();
                        fill<offset + (depth / 2), depth / 2>();
                }

                typedef int (*FuncPtr)(int);
                FuncPtr funcs[size];
};


int main() {
        OperationBank<MultiplyByX, 65536> multipliers;

        // This for loop is contrived, but empirically sufficient to prompt
        // GCC not to optimise out the full construction of multipliers above.
        for(int c = 0; c < 3; c++) {
                printf("%d\n", multipliers.apply(673+c, 3));
        }

        return 0;
}

I then measured build times and binary sizes with GCC --std=c++17 -O3 on a 2.6Ghz i7 for different 'size's of the instance of OperationBank:

size Compile Time (s) Binary Size (bytes)
1024 1.317 160,240
16384 21.001 2,517,720
65536 104.598s 10,110,344

The close-enough-to linear change in binary size was taken as confirmation that GCC really was building the full function pointer sets; the worse-than-linear jump in compile time size was taken as a warning.

Confirmed, then:

  • it really makes sense to template on a minimal set of fundamentals — likely addressing mode only (albeit in the slightly broad sense, e.g. to include direction for the 8086);
  • anything that can branchlessly be read from the instruction stream without undue contortion such as the specific registers or constants involved should be read from the instruction stream, not precompiled into overly-specific functions.

These were sensible assumptions, but are even more sensible as empiricals.

[Caveat: speaking extemporaneously, it's possible that one-function-per-instruction should be tested for a 68k, if memory serves that each instruction is fully identified by the first word]