mirror of
https://github.com/ctm/executor.git
synced 2024-11-23 05:33:16 +00:00
427 lines
16 KiB
Plaintext
427 lines
16 KiB
Plaintext
Syn68k: ARDI's dynamically compiling 68LC040 emulator
|
|
|
|
Mat Hostetter <mat@ardi.com>
|
|
|
|
|
|
I. Overview
|
|
|
|
This document is meant to give a concise technical summary of how
|
|
syn68k works.
|
|
|
|
"Syn68k", ARDI's 68LC040 emulator, is both highly portable and
|
|
fast. The portable core of syn68k, which works by dynamically
|
|
compiling 680x0 code into an efficient interpreted form, was designed
|
|
to run on all major CPU's. On supported architectures, syn68k can
|
|
also translate 680x0 code into native code that the host processor
|
|
can run directly.
|
|
|
|
|
|
II. Syngen
|
|
|
|
ARDI's "syngen" system analyzes a lisp-like file describing the
|
|
bit patterns and semantics of the 680x0 instruction set and produces
|
|
lookup tables and C code for the runtime system to use. This
|
|
process takes place only when syn68k is built, so we can afford
|
|
extensive analysis here. The code and tables generated by syngen
|
|
depend somewhat on the characteristics of the host processor; for
|
|
example, on a little endian machine it is advantageous to byte swap
|
|
some extracted 680x0 operands at compile time instead of at runtime.
|
|
|
|
The 680x0 description file can describe multiple ways to emulate
|
|
any particular 680x0 opcode. The runtime system looks at what CC
|
|
bits are live after the instruction and chooses the fastest variant
|
|
it can legally use. In the following example, we have two CC
|
|
variants of lsrw; one computes no CC bits, and the other computes
|
|
all of them:
|
|
|
|
|
|
(defopcode lsrw_ea
|
|
(list 68000 amode_alterable_memory () (list "1110001011mmmmmm"))
|
|
(list "-----" "-----" dont_expand
|
|
(assign $1.muw (>> $1.muw 1)))
|
|
(list "CN0XZ" "-----" dont_expand
|
|
(list
|
|
(assign ccx (assign ccc (& $1.muw 1)))
|
|
(ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1))))))
|
|
|
|
|
|
The 680x0 description file can also specify which 680x0 operands
|
|
should be "expanded" to become implicitly known by the corresponding
|
|
synthetic opcode. For example, fully expanding out "addl dx,dy"
|
|
would result in 64 synthetic opcodes, one for each combination of
|
|
data register operands. This results in smaller and faster synthetic
|
|
opcodes at the expense of increasing the total number of synthetic
|
|
opcodes. To conserve space, we only expand out common 680x0 opcodes.
|
|
On host architectures where we can compile to native code, we don't
|
|
waste space by "expanding out" common synthetic opcodes.
|
|
|
|
|
|
III. Test suites
|
|
|
|
ARDI has a large set of test suites that try thousands upon thousands
|
|
of variations of 680x0 opcodes and compare the results to those
|
|
generated by a real 68040. These test suites have proven to be an
|
|
invaluable debugging tool, both as new features are added and as
|
|
we have ported syn68k to other architectures (notably 80x86, 680x0,
|
|
i860 and Alpha). Our native code support is so recent that our
|
|
test suites do not yet adequately test all of the situations that
|
|
arise when generating native code, but we plan to extend them in
|
|
the near future.
|
|
|
|
|
|
IV. Interpreted code
|
|
|
|
Our interpreted code consists of contiguous sequences of "synthetic
|
|
opcodes" and their operands. Syngen can generate ANSI C, but when
|
|
compiled with gcc it uses C language extensions that make synthetic
|
|
opcodes pointers to the C code responsible for interpreting that
|
|
opcode. This "threaded interpreting" entirely eliminates switch
|
|
dispatch and loop overhead.
|
|
|
|
To illustrate the above points, here is the assembly language
|
|
generated for the synthetic opcode that would handle a fully expanded
|
|
"addl d0,d1" when no CC bit values are required. This is what
|
|
gcc's 80x86 output looks like (edited for readability) after we
|
|
run our interpreter through ARDI's Pentium-specific instruction
|
|
scheduling Perl script:
|
|
|
|
movl _d0,%eax ; fetch d0
|
|
movl (%esi),%edi ; fetch next synthetic opcode
|
|
addl %eax,_d1 ; do the add
|
|
addl $4,%esi ; increment synthetic PC
|
|
jmp *%edi ; jump to next synthetic opcode handler
|
|
|
|
We must emphasize that the preceding example is not native code
|
|
generated by our emulator, but merely a snippet of what gcc generates
|
|
for our interpreter. This gives you some idea of the efficiency
|
|
of the portable component of our emulator.
|
|
|
|
|
|
V. Native code
|
|
|
|
Syn68k supports optional architecture-specific native code extensions.
|
|
On systems where they are present, the runtime system tries to
|
|
generate native code whenever possible. In those rare cases when
|
|
it cannot, it reverts to our interpreted code. Since syn68k supports
|
|
both native and synthetic code, the runtime system automatically
|
|
inserts gateways between the two whenever there is a transition.
|
|
This approach allows us to gradually phase in native code handlers
|
|
for most 680x0 instructions while leaving tricky and unimportant
|
|
rare cases alone.
|
|
|
|
Although our native code compilation engine is not architecture-specific,
|
|
to date we have only implemented an 80x86 back end. The 80x86
|
|
architecture has achieved such important status in the industry
|
|
that it makes sense for us to describe how we generate native code
|
|
for it, even though many of these techniques would not be necessary
|
|
on RISC architectures.
|
|
|
|
We are glad that we implemented the most difficult back end first.
|
|
We believe that, were we to have started with a RISC back end, we
|
|
would have quite possibly architected a system where retrofitting
|
|
the exotic mechanisms necessary for efficient 80x86 support was
|
|
difficult.
|
|
|
|
Three major problems make translating 680x0 code to 80x86 code difficult:
|
|
|
|
1) The 80x86 has only 8 registers, while the 680x0 has 16.
|
|
2) The 80x86 is little endian, while the 680x0 is big endian.
|
|
3) The 80x86 does not have general-purpose postincrement and predecrement
|
|
operators, which are used frequently in 680x0 code.
|
|
|
|
On the other hand, several factors make the job easier:
|
|
|
|
1) The 80x86 has all of the CISC addressing modes commonly used in 680x0 code.
|
|
2) The 80x86 has CC bits that map directly to their 680x0 counterparts
|
|
(except for the 680x0's X bit).
|
|
3) The 80x86 supports 8-, 16- and 32-bit operations, (although it can only
|
|
support 8 bit operations on four of its registers).
|
|
4) The 80x86 and 680x0 have analogous conditional branch instructions.
|
|
5) The 80x86 allows unaligned memory accesses without substantial overhead.
|
|
|
|
The toughest problem is the lack of registers. On 32-register RISC
|
|
architectures it's easy to allocate one RISC register for each
|
|
680x0 register, but on the 80x86 a different approach is needed.
|
|
The obvious solution is to perform full-blown inter-block register
|
|
allocation, but we fear that using traditional compiler techniques
|
|
would be unacceptably slow.
|
|
|
|
For now, we have adopted a simple constraint: between basic blocks,
|
|
all registers and live CC bits must reside in their canonical home
|
|
in memory. Within a block, anything goes. So what liberties does
|
|
syn68k take within a block?
|
|
|
|
The 80x86 register set is treated as a cache for recently used
|
|
680x0 registers, and the 80x86 CC bits are used as a cache for the
|
|
680x0 CC bits. At any particular point within a block, each 680x0
|
|
register is either sitting in its memory home or is cached in an
|
|
80x86 register, and each live 680x0 CC bit is either cached in its
|
|
80x86 equivalent or stored in its memory home. Cached registers
|
|
may be in canonical form, may be byte swapped, may have only their
|
|
low two bytes swapped, or may be offset by a known constant from
|
|
their actual value.
|
|
|
|
Each 680x0 instruction can require that 680x0 registers be cached
|
|
in particular ways; the compilation engine generates the minimal
|
|
code needed to satisfy those constraints and then calls a sequence
|
|
of routines to generate the native code. As each 680x0 instruction
|
|
is processed, each 680x0 register's cache status is updated. Dirty
|
|
registers are canonicalized and spilled back to memory at the end
|
|
of each block (or when we run out of 80x86 registers and we need
|
|
to make room).
|
|
|
|
We allow 680x0 registers to be cached with varying byte orders and
|
|
offsets so that we can perform the optimizations of lazy byte
|
|
swapping and lazy constant offsetting. If the 680x0 program loads
|
|
a register from memory and then ends up writing it out later, we
|
|
avoid unnecessary byte swaps by not canonicalizing the value
|
|
immediately. Lazy constant offsetting mitigates the overhead of
|
|
postincrement and predecrement side effects. For example, this
|
|
680x0 code:
|
|
|
|
pea 0x1
|
|
pea 0x2
|
|
pea 0x3
|
|
pea 0x4
|
|
...
|
|
|
|
becomes this 80x86 code:
|
|
|
|
movl _a7,%edi
|
|
movl $0x01000000,-4(%edi) ; "push" big-endian constant
|
|
movl $0x02000000,-8(%edi)
|
|
movl $0x03000000,-12(%edi)
|
|
movl $0x04000000,-16(%edi)
|
|
... <more uses of a7 may follow, and they'll use %edi>
|
|
subl $16,%edi
|
|
movl $edi,_a7
|
|
...
|
|
|
|
|
|
As mentioned above, we use the 80x86 condition code bits as a cache
|
|
for the real 680x0 CC bits. Although live cached CC bits are
|
|
occasionally spilled back to memory because some 80x86 instruction
|
|
is about to clobber them, this trick almost always works. Using
|
|
80x86 CC bits, we can frequently get away with extremely concise
|
|
code sequences; for example, a 680x0 compare and conditional branch
|
|
becomes an 80x86 compare and conditional branch.
|
|
|
|
|
|
VI. Self-modifying code
|
|
|
|
Like most dynamically compiling emulators, syn68k doesn't detect
|
|
self-modifying code; the overhead is too high. Fortunately,
|
|
self-modifying programs don't work on the real 68040 either. We
|
|
rely on the program making explicit system calls to flush the caches
|
|
whenever 680x0 code may have been modified or created. Some programs
|
|
(like HyperCard) flush the caches very often, which can cause real
|
|
performance headaches if code is continuously recompiled. We have
|
|
solved this problem by checksumming 680x0 blocks as they are compiled
|
|
and only decompiling blocks which fail their checksums. This
|
|
optimization alone sped up some HyperCard stacks by a factor of
|
|
three or so.
|
|
|
|
|
|
|
|
VII. Responsiveness
|
|
|
|
Responsiveness is a concern for any dynamic compiler. Fortunately,
|
|
syn68k is largely driven by automatically generated lookup tables
|
|
so compilation speed is good. Like other dynamic compilers, syn68k
|
|
only bothers to compile 680x0 code when it encounters it for the
|
|
first time.
|
|
|
|
When syn68k encounters new code, it compiles other 680x0 code that
|
|
it can reach from there but does not compile through jsr's. Only
|
|
when a jsr is actually executed does syn68k compile the target
|
|
routine. Once that target routine is compiled, syn68k modifies
|
|
the jsr handler to point directly to the target routine so that
|
|
the jsr will be extremely fast the second time it is executed.
|
|
We've found that lazily compiling through jsr's does a good job of
|
|
avoiding compilation lag that might annoy the user.
|
|
|
|
Syn68k does not attempt to generate native code for a basic block
|
|
until that block (or a nearby one) has been executed 50 times.
|
|
This saves memory and some compilation time, although we haven't
|
|
noticed any particular sluggishness when compiling to native code.
|
|
|
|
|
|
VIII. Other optimizations
|
|
|
|
Syn68k maintains an internal "jsr stack" to speed up the common
|
|
case of jsr/rts. We realize that the rts address might have been
|
|
fiddled with, so the rts handler verifies at runtime that the rts
|
|
address matches the tag on top of the jsr stack. If it matches,
|
|
syn68k does a fast jump. If it doesn't, syn68k looks up the code
|
|
corresponding to the rts address in a hash table.
|
|
|
|
|
|
IX. Neat hacks
|
|
|
|
The low-level code generation routines for the 80x86 back end are
|
|
machine generated from assembly language templates. Thousands of
|
|
operand permutations for 80x86 instructions of interest are run
|
|
through the system's assembler and analyzed to derive the rules
|
|
the assembler uses to create binaries. Those rules are encapsulated
|
|
into C code and compiled into syn68k so we can generate binaries
|
|
on the fly. Here is a sample template:
|
|
|
|
{ "i386_leal_indoff", "", "", "", "", "-",
|
|
"leal %0(%1),%2",
|
|
{ "offset", "base", "dst" },
|
|
{ { SIZE_32, CONSTANT, IN }, { SIZE_32, REGISTER, IN },
|
|
{ SIZE_32, REGISTER, OUT } } },
|
|
|
|
This approach has saved us countless hours of debugging and allows
|
|
our system to automatically perform the same optimizations as the
|
|
host system's assembler.
|
|
|
|
We've annotated our 80x86 descriptions with information about
|
|
Pentium pairability so that future versions of syn68k can schedule
|
|
the native code we generate (we already schedule our main interpreter
|
|
when we build syn68k).
|
|
|
|
|
|
X. Future optimizations
|
|
|
|
We are working on a simple inter-block register allocation algorithm.
|
|
|
|
By relocating most 680x0 register->80x86 register moves to the
|
|
beginning of each block, we can improve Pentium pairability and
|
|
reduce 80486 and Pentium address generation pipeline stalls.
|
|
|
|
Now that we compile to native code, A-line trap overhead is becoming
|
|
significant. We may soon compile A-line traps to native code that
|
|
directly calls the appropriate ROMlib routine with the appropriate
|
|
arguments (checking at runtime to make sure that neither the trap
|
|
nor the A-line vector has been patched out, of course).
|
|
|
|
|
|
XI. Code examples
|
|
|
|
Here are two sample 680x0 code sequences from real applications,
|
|
and the 80x86 code that syn68k generates for them. We chose these
|
|
code sequences specifically to showcase several of the techniques
|
|
we use, so you shouldn't use them as a substitute for benchmarks.
|
|
Not all 680x0 code translates as well as these examples do, but
|
|
these examples are far from exotic.
|
|
|
|
|
|
Example 1 (Solarian):
|
|
|
|
680x0 code:
|
|
|
|
addqb #1,a4@(1)
|
|
movel #0,d0
|
|
moveb a4@,d0
|
|
swap d0
|
|
clrw d0
|
|
swap d0
|
|
asll #2,d0
|
|
lea a5@(-13462),a0
|
|
addal d0,a0
|
|
moveal a0@,a0
|
|
movel #0,d0
|
|
moveb a4@(1),d0
|
|
cmpw a0@,d0
|
|
bcs 0x3fffee2
|
|
|
|
|
|
80x86 code:
|
|
|
|
movl _a4,%edi ; addqb #1,a4@(1)
|
|
addb $0x1,0x1(%edi)
|
|
xorl %ebx,%ebx ; movel #0,d0
|
|
movb (%edi),%bl ; moveb a4@,d0
|
|
rorl $0x10,%ebx ; swap d0
|
|
xorw %bx,%bx ; clrw d0
|
|
rorl $0x10,%ebx ; swap d0
|
|
shll $0x2,%ebx ; asll #2,d0
|
|
movl _a5,%esi ; lea a5@(-13462),a0
|
|
leal 0xffffcb6a(%esi),%edx
|
|
addl %ebx,%edx ; addal d0,a0
|
|
movl (%edx),%edx ; moveal a0@,a0
|
|
xorl %ebx,%ebx ; movel #0,d0
|
|
movb 0x1(%edi),%bl ; moveb a4@(1),d0
|
|
bswap %edx ; cmpw a0@,d0
|
|
movw (%edx),%cx
|
|
rorw $0x8,%cx
|
|
cmpw %cx,%bx
|
|
movl %edx,_a0 ; <spill dirty 68k
|
|
movl %ebx,_d0 ; registers back to memory>
|
|
jb 0x6fae0c ; bcs 0x3fffee2
|
|
jmp 0x6faf0c ; <go to "fall through" code>
|
|
|
|
|
|
|
|
|
|
Example 2 (PageMaker):
|
|
|
|
680x0 code:
|
|
|
|
movel #0,d2
|
|
moveb d0,d2
|
|
lslw #8,d0
|
|
orw d0,d2
|
|
movel d2,d0
|
|
swap d2
|
|
orl d2,d0
|
|
movel a0,d2
|
|
lsrb #1,d2
|
|
bcc 0x3fffed4
|
|
|
|
80x86 code:
|
|
|
|
xorl %ebx,%ebx ; movel #0,d2
|
|
movl _d0,%edx ; moveb d0,d2
|
|
movb %dl,%bl
|
|
shlw $0x8,%dx ; lslw #8,d0
|
|
orw %dx,%bx ; orw d0,d2
|
|
movl %ebx,%edx ; movel d2,d0
|
|
rorl $0x10,%ebx ; swap d2
|
|
orl %ebx,%edx ; orl d2,d0
|
|
movl _a0,%ecx ; movel a0,d2
|
|
movl %ecx,%ebx
|
|
shrb %bl ; lsrb #1,d2
|
|
movl %ebx,_d2 ; <spill dirty 68k
|
|
movl %edx,_d0 ; registers back to memory>
|
|
jae 0x3b734c ; bcc 0x3fffed4
|
|
jmp 0x43d48c ; <go to "fall through" 68k code>
|
|
|
|
|
|
XII. Benchmarks
|
|
|
|
These performance numbers were computed with Speedometer 3.23.
|
|
We've removed the floating point tests from the list since they do
|
|
not measure syn68k's speed. Syn68k contains no special provisions
|
|
for Speedometer's benchmarks and we believe that these numbers are
|
|
indicative of syn68k's performance for many other CPU-intensive
|
|
programs.
|
|
|
|
|
|
Quadra Pentium 486DX4 486DX/2
|
|
610 90MHz 75MHz 66MHz
|
|
------ ------ ------ ------
|
|
CPU 16.018 28.833 15.727 13.840
|
|
|
|
Dhrystones 19.586 21.886 12.084 9.424
|
|
Tower 18.909 27.130 12.235 11.556
|
|
Quicksort 17.759 27.105 15.606 13.919
|
|
Bubble sort 18.409 31.154 19.286 16.875
|
|
Queens 19.083 38.167 19.083 18.320
|
|
Puzzle 22.083 44.167 23.661 21.032
|
|
Permutations 21.019 28.564 11.604 12.242
|
|
Int. Matrix 24.200 26.469 19.369 16.608
|
|
Sieve 23.362 60.290 33.982 30.145
|
|
------ ------ ------ ------
|
|
Average 20.490 33.881 18.582 16.680
|
|
|
|
|
|
|
|
Preliminary analysis suggests that we average a roughly 3:1
|
|
instruction count increase when translating to 80x86 code. We have
|
|
not yet taken rigorous measurements, but the 3:1 figure is lent
|
|
some credence by the fact that our 75MHz 486DX4 gets nearly the
|
|
same performance as our Quadra 610. We believe that inter-block
|
|
register allocation will noticeably improve this ratio.
|