mirror of
https://github.com/ctm/executor.git
synced 2026-01-23 11:16:17 +00:00
All the Executor source as-is from the subversion tree it was last worked on.
This commit is contained in:
426
docs/SynPaper
Normal file
426
docs/SynPaper
Normal file
@@ -0,0 +1,426 @@
|
||||
Syn68k: ARDI's dynamically compiling 68LC040 emulator
|
||||
|
||||
Mat Hostetter <mat@ardi.com>
|
||||
|
||||
|
||||
I. Overview
|
||||
|
||||
This document is meant to give a concise technical summary of how
|
||||
syn68k works.
|
||||
|
||||
"Syn68k", ARDI's 68LC040 emulator, is both highly portable and
|
||||
fast. The portable core of syn68k, which works by dynamically
|
||||
compiling 680x0 code into an efficient interpreted form, was designed
|
||||
to run on all major CPU's. On supported architectures, syn68k can
|
||||
also translate 680x0 code into native code that the host processor
|
||||
can run directly.
|
||||
|
||||
|
||||
II. Syngen
|
||||
|
||||
ARDI's "syngen" system analyzes a lisp-like file describing the
|
||||
bit patterns and semantics of the 680x0 instruction set and produces
|
||||
lookup tables and C code for the runtime system to use. This
|
||||
process takes place only when syn68k is built, so we can afford
|
||||
extensive analysis here. The code and tables generated by syngen
|
||||
depend somewhat on the characteristics of the host processor; for
|
||||
example, on a little endian machine it is advantageous to byte swap
|
||||
some extracted 680x0 operands at compile time instead of at runtime.
|
||||
|
||||
The 680x0 description file can describe multiple ways to emulate
|
||||
any particular 680x0 opcode. The runtime system looks at what CC
|
||||
bits are live after the instruction and chooses the fastest variant
|
||||
it can legally use. In the following example, we have two CC
|
||||
variants of lsrw; one computes no CC bits, and the other computes
|
||||
all of them:
|
||||
|
||||
|
||||
(defopcode lsrw_ea
|
||||
(list 68000 amode_alterable_memory () (list "1110001011mmmmmm"))
|
||||
(list "-----" "-----" dont_expand
|
||||
(assign $1.muw (>> $1.muw 1)))
|
||||
(list "CN0XZ" "-----" dont_expand
|
||||
(list
|
||||
(assign ccx (assign ccc (& $1.muw 1)))
|
||||
(ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1))))))
|
||||
|
||||
|
||||
The 680x0 description file can also specify which 680x0 operands
|
||||
should be "expanded" to become implicitly known by the corresponding
|
||||
synthetic opcode. For example, fully expanding out "addl dx,dy"
|
||||
would result in 64 synthetic opcodes, one for each combination of
|
||||
data register operands. This results in smaller and faster synthetic
|
||||
opcodes at the expense of increasing the total number of synthetic
|
||||
opcodes. To conserve space, we only expand out common 680x0 opcodes.
|
||||
On host architectures where we can compile to native code, we don't
|
||||
waste space by "expanding out" common synthetic opcodes.
|
||||
|
||||
|
||||
III. Test suites
|
||||
|
||||
ARDI has a large set of test suites that try thousands upon thousands
|
||||
of variations of 680x0 opcodes and compare the results to those
|
||||
generated by a real 68040. These test suites have proven to be an
|
||||
invaluable debugging tool, both as new features are added and as
|
||||
we have ported syn68k to other architectures (notably 80x86, 680x0,
|
||||
i860 and Alpha). Our native code support is so recent that our
|
||||
test suites do not yet adequately test all of the situations that
|
||||
arise when generating native code, but we plan to extend them in
|
||||
the near future.
|
||||
|
||||
|
||||
IV. Interpreted code
|
||||
|
||||
Our interpreted code consists of contiguous sequences of "synthetic
|
||||
opcodes" and their operands. Syngen can generate ANSI C, but when
|
||||
compiled with gcc it uses C language extensions that make synthetic
|
||||
opcodes pointers to the C code responsible for interpreting that
|
||||
opcode. This "threaded interpreting" entirely eliminates switch
|
||||
dispatch and loop overhead.
|
||||
|
||||
To illustrate the above points, here is the assembly language
|
||||
generated for the synthetic opcode that would handle a fully expanded
|
||||
"addl d0,d1" when no CC bit values are required. This is what
|
||||
gcc's 80x86 output looks like (edited for readability) after we
|
||||
run our interpreter through ARDI's Pentium-specific instruction
|
||||
scheduling Perl script:
|
||||
|
||||
movl _d0,%eax ; fetch d0
|
||||
movl (%esi),%edi ; fetch next synthetic opcode
|
||||
addl %eax,_d1 ; do the add
|
||||
addl $4,%esi ; increment synthetic PC
|
||||
jmp *%edi ; jump to next synthetic opcode handler
|
||||
|
||||
We must emphasize that the preceding example is not native code
|
||||
generated by our emulator, but merely a snippet of what gcc generates
|
||||
for our interpreter. This gives you some idea of the efficiency
|
||||
of the portable component of our emulator.
|
||||
|
||||
|
||||
V. Native code
|
||||
|
||||
Syn68k supports optional architecture-specific native code extensions.
|
||||
On systems where they are present, the runtime system tries to
|
||||
generate native code whenever possible. In those rare cases when
|
||||
it cannot, it reverts to our interpreted code. Since syn68k supports
|
||||
both native and synthetic code, the runtime system automatically
|
||||
inserts gateways between the two whenever there is a transition.
|
||||
This approach allows us to gradually phase in native code handlers
|
||||
for most 680x0 instructions while leaving tricky and unimportant
|
||||
rare cases alone.
|
||||
|
||||
Although our native code compilation engine is not architecture-specific,
|
||||
to date we have only implemented an 80x86 back end. The 80x86
|
||||
architecture has achieved such important status in the industry
|
||||
that it makes sense for us to describe how we generate native code
|
||||
for it, even though many of these techniques would not be necessary
|
||||
on RISC architectures.
|
||||
|
||||
We are glad that we implemented the most difficult back end first.
|
||||
We believe that, were we to have started with a RISC back end, we
|
||||
would have quite possibly architected a system where retrofitting
|
||||
the exotic mechanisms necessary for efficient 80x86 support was
|
||||
difficult.
|
||||
|
||||
Three major problems make translating 680x0 code to 80x86 code difficult:
|
||||
|
||||
1) The 80x86 has only 8 registers, while the 680x0 has 16.
|
||||
2) The 80x86 is little endian, while the 680x0 is big endian.
|
||||
3) The 80x86 does not have general-purpose postincrement and predecrement
|
||||
operators, which are used frequently in 680x0 code.
|
||||
|
||||
On the other hand, several factors make the job easier:
|
||||
|
||||
1) The 80x86 has all of the CISC addressing modes commonly used in 680x0 code.
|
||||
2) The 80x86 has CC bits that map directly to their 680x0 counterparts
|
||||
(except for the 680x0's X bit).
|
||||
3) The 80x86 supports 8-, 16- and 32-bit operations, (although it can only
|
||||
support 8 bit operations on four of its registers).
|
||||
4) The 80x86 and 680x0 have analogous conditional branch instructions.
|
||||
5) The 80x86 allows unaligned memory accesses without substantial overhead.
|
||||
|
||||
The toughest problem is the lack of registers. On 32-register RISC
|
||||
architectures it's easy to allocate one RISC register for each
|
||||
680x0 register, but on the 80x86 a different approach is needed.
|
||||
The obvious solution is to perform full-blown inter-block register
|
||||
allocation, but we fear that using traditional compiler techniques
|
||||
would be unacceptably slow.
|
||||
|
||||
For now, we have adopted a simple constraint: between basic blocks,
|
||||
all registers and live CC bits must reside in their canonical home
|
||||
in memory. Within a block, anything goes. So what liberties does
|
||||
syn68k take within a block?
|
||||
|
||||
The 80x86 register set is treated as a cache for recently used
|
||||
680x0 registers, and the 80x86 CC bits are used as a cache for the
|
||||
680x0 CC bits. At any particular point within a block, each 680x0
|
||||
register is either sitting in its memory home or is cached in an
|
||||
80x86 register, and each live 680x0 CC bit is either cached in its
|
||||
80x86 equivalent or stored in its memory home. Cached registers
|
||||
may be in canonical form, may be byte swapped, may have only their
|
||||
low two bytes swapped, or may be offset by a known constant from
|
||||
their actual value.
|
||||
|
||||
Each 680x0 instruction can require that 680x0 registers be cached
|
||||
in particular ways; the compilation engine generates the minimal
|
||||
code needed to satisfy those constraints and then calls a sequence
|
||||
of routines to generate the native code. As each 680x0 instruction
|
||||
is processed, each 680x0 register's cache status is updated. Dirty
|
||||
registers are canonicalized and spilled back to memory at the end
|
||||
of each block (or when we run out of 80x86 registers and we need
|
||||
to make room).
|
||||
|
||||
We allow 680x0 registers to be cached with varying byte orders and
|
||||
offsets so that we can perform the optimizations of lazy byte
|
||||
swapping and lazy constant offsetting. If the 680x0 program loads
|
||||
a register from memory and then ends up writing it out later, we
|
||||
avoid unnecessary byte swaps by not canonicalizing the value
|
||||
immediately. Lazy constant offsetting mitigates the overhead of
|
||||
postincrement and predecrement side effects. For example, this
|
||||
680x0 code:
|
||||
|
||||
pea 0x1
|
||||
pea 0x2
|
||||
pea 0x3
|
||||
pea 0x4
|
||||
...
|
||||
|
||||
becomes this 80x86 code:
|
||||
|
||||
movl _a7,%edi
|
||||
movl $0x01000000,-4(%edi) ; "push" big-endian constant
|
||||
movl $0x02000000,-8(%edi)
|
||||
movl $0x03000000,-12(%edi)
|
||||
movl $0x04000000,-16(%edi)
|
||||
... <more uses of a7 may follow, and they'll use %edi>
|
||||
subl $16,%edi
|
||||
movl $edi,_a7
|
||||
...
|
||||
|
||||
|
||||
As mentioned above, we use the 80x86 condition code bits as a cache
|
||||
for the real 680x0 CC bits. Although live cached CC bits are
|
||||
occasionally spilled back to memory because some 80x86 instruction
|
||||
is about to clobber them, this trick almost always works. Using
|
||||
80x86 CC bits, we can frequently get away with extremely concise
|
||||
code sequences; for example, a 680x0 compare and conditional branch
|
||||
becomes an 80x86 compare and conditional branch.
|
||||
|
||||
|
||||
VI. Self-modifying code
|
||||
|
||||
Like most dynamically compiling emulators, syn68k doesn't detect
|
||||
self-modifying code; the overhead is too high. Fortunately,
|
||||
self-modifying programs don't work on the real 68040 either. We
|
||||
rely on the program making explicit system calls to flush the caches
|
||||
whenever 680x0 code may have been modified or created. Some programs
|
||||
(like HyperCard) flush the caches very often, which can cause real
|
||||
performance headaches if code is continuously recompiled. We have
|
||||
solved this problem by checksumming 680x0 blocks as they are compiled
|
||||
and only decompiling blocks which fail their checksums. This
|
||||
optimization alone sped up some HyperCard stacks by a factor of
|
||||
three or so.
|
||||
|
||||
|
||||
|
||||
VII. Responsiveness
|
||||
|
||||
Responsiveness is a concern for any dynamic compiler. Fortunately,
|
||||
syn68k is largely driven by automatically generated lookup tables
|
||||
so compilation speed is good. Like other dynamic compilers, syn68k
|
||||
only bothers to compile 680x0 code when it encounters it for the
|
||||
first time.
|
||||
|
||||
When syn68k encounters new code, it compiles other 680x0 code that
|
||||
it can reach from there but does not compile through jsr's. Only
|
||||
when a jsr is actually executed does syn68k compile the target
|
||||
routine. Once that target routine is compiled, syn68k modifies
|
||||
the jsr handler to point directly to the target routine so that
|
||||
the jsr will be extremely fast the second time it is executed.
|
||||
We've found that lazily compiling through jsr's does a good job of
|
||||
avoiding compilation lag that might annoy the user.
|
||||
|
||||
Syn68k does not attempt to generate native code for a basic block
|
||||
until that block (or a nearby one) has been executed 50 times.
|
||||
This saves memory and some compilation time, although we haven't
|
||||
noticed any particular sluggishness when compiling to native code.
|
||||
|
||||
|
||||
VIII. Other optimizations
|
||||
|
||||
Syn68k maintains an internal "jsr stack" to speed up the common
|
||||
case of jsr/rts. We realize that the rts address might have been
|
||||
fiddled with, so the rts handler verifies at runtime that the rts
|
||||
address matches the tag on top of the jsr stack. If it matches,
|
||||
syn68k does a fast jump. If it doesn't, syn68k looks up the code
|
||||
corresponding to the rts address in a hash table.
|
||||
|
||||
|
||||
IX. Neat hacks
|
||||
|
||||
The low-level code generation routines for the 80x86 back end are
|
||||
machine generated from assembly language templates. Thousands of
|
||||
operand permutations for 80x86 instructions of interest are run
|
||||
through the system's assembler and analyzed to derive the rules
|
||||
the assembler uses to create binaries. Those rules are encapsulated
|
||||
into C code and compiled into syn68k so we can generate binaries
|
||||
on the fly. Here is a sample template:
|
||||
|
||||
{ "i386_leal_indoff", "", "", "", "", "-",
|
||||
"leal %0(%1),%2",
|
||||
{ "offset", "base", "dst" },
|
||||
{ { SIZE_32, CONSTANT, IN }, { SIZE_32, REGISTER, IN },
|
||||
{ SIZE_32, REGISTER, OUT } } },
|
||||
|
||||
This approach has saved us countless hours of debugging and allows
|
||||
our system to automatically perform the same optimizations as the
|
||||
host system's assembler.
|
||||
|
||||
We've annotated our 80x86 descriptions with information about
|
||||
Pentium pairability so that future versions of syn68k can schedule
|
||||
the native code we generate (we already schedule our main interpreter
|
||||
when we build syn68k).
|
||||
|
||||
|
||||
X. Future optimizations
|
||||
|
||||
We are working on a simple inter-block register allocation algorithm.
|
||||
|
||||
By relocating most 680x0 register->80x86 register moves to the
|
||||
beginning of each block, we can improve Pentium pairability and
|
||||
reduce 80486 and Pentium address generation pipeline stalls.
|
||||
|
||||
Now that we compile to native code, A-line trap overhead is becoming
|
||||
significant. We may soon compile A-line traps to native code that
|
||||
directly calls the appropriate ROMlib routine with the appropriate
|
||||
arguments (checking at runtime to make sure that neither the trap
|
||||
nor the A-line vector has been patched out, of course).
|
||||
|
||||
|
||||
XI. Code examples
|
||||
|
||||
Here are two sample 680x0 code sequences from real applications,
|
||||
and the 80x86 code that syn68k generates for them. We chose these
|
||||
code sequences specifically to showcase several of the techniques
|
||||
we use, so you shouldn't use them as a substitute for benchmarks.
|
||||
Not all 680x0 code translates as well as these examples do, but
|
||||
these examples are far from exotic.
|
||||
|
||||
|
||||
Example 1 (Solarian):
|
||||
|
||||
680x0 code:
|
||||
|
||||
addqb #1,a4@(1)
|
||||
movel #0,d0
|
||||
moveb a4@,d0
|
||||
swap d0
|
||||
clrw d0
|
||||
swap d0
|
||||
asll #2,d0
|
||||
lea a5@(-13462),a0
|
||||
addal d0,a0
|
||||
moveal a0@,a0
|
||||
movel #0,d0
|
||||
moveb a4@(1),d0
|
||||
cmpw a0@,d0
|
||||
bcs 0x3fffee2
|
||||
|
||||
|
||||
80x86 code:
|
||||
|
||||
movl _a4,%edi ; addqb #1,a4@(1)
|
||||
addb $0x1,0x1(%edi)
|
||||
xorl %ebx,%ebx ; movel #0,d0
|
||||
movb (%edi),%bl ; moveb a4@,d0
|
||||
rorl $0x10,%ebx ; swap d0
|
||||
xorw %bx,%bx ; clrw d0
|
||||
rorl $0x10,%ebx ; swap d0
|
||||
shll $0x2,%ebx ; asll #2,d0
|
||||
movl _a5,%esi ; lea a5@(-13462),a0
|
||||
leal 0xffffcb6a(%esi),%edx
|
||||
addl %ebx,%edx ; addal d0,a0
|
||||
movl (%edx),%edx ; moveal a0@,a0
|
||||
xorl %ebx,%ebx ; movel #0,d0
|
||||
movb 0x1(%edi),%bl ; moveb a4@(1),d0
|
||||
bswap %edx ; cmpw a0@,d0
|
||||
movw (%edx),%cx
|
||||
rorw $0x8,%cx
|
||||
cmpw %cx,%bx
|
||||
movl %edx,_a0 ; <spill dirty 68k
|
||||
movl %ebx,_d0 ; registers back to memory>
|
||||
jb 0x6fae0c ; bcs 0x3fffee2
|
||||
jmp 0x6faf0c ; <go to "fall through" code>
|
||||
|
||||
|
||||
|
||||
|
||||
Example 2 (PageMaker):
|
||||
|
||||
680x0 code:
|
||||
|
||||
movel #0,d2
|
||||
moveb d0,d2
|
||||
lslw #8,d0
|
||||
orw d0,d2
|
||||
movel d2,d0
|
||||
swap d2
|
||||
orl d2,d0
|
||||
movel a0,d2
|
||||
lsrb #1,d2
|
||||
bcc 0x3fffed4
|
||||
|
||||
80x86 code:
|
||||
|
||||
xorl %ebx,%ebx ; movel #0,d2
|
||||
movl _d0,%edx ; moveb d0,d2
|
||||
movb %dl,%bl
|
||||
shlw $0x8,%dx ; lslw #8,d0
|
||||
orw %dx,%bx ; orw d0,d2
|
||||
movl %ebx,%edx ; movel d2,d0
|
||||
rorl $0x10,%ebx ; swap d2
|
||||
orl %ebx,%edx ; orl d2,d0
|
||||
movl _a0,%ecx ; movel a0,d2
|
||||
movl %ecx,%ebx
|
||||
shrb %bl ; lsrb #1,d2
|
||||
movl %ebx,_d2 ; <spill dirty 68k
|
||||
movl %edx,_d0 ; registers back to memory>
|
||||
jae 0x3b734c ; bcc 0x3fffed4
|
||||
jmp 0x43d48c ; <go to "fall through" 68k code>
|
||||
|
||||
|
||||
XII. Benchmarks
|
||||
|
||||
These performance numbers were computed with Speedometer 3.23.
|
||||
We've removed the floating point tests from the list since they do
|
||||
not measure syn68k's speed. Syn68k contains no special provisions
|
||||
for Speedometer's benchmarks and we believe that these numbers are
|
||||
indicative of syn68k's performance for many other CPU-intensive
|
||||
programs.
|
||||
|
||||
|
||||
Quadra Pentium 486DX4 486DX/2
|
||||
610 90MHz 75MHz 66MHz
|
||||
------ ------ ------ ------
|
||||
CPU 16.018 28.833 15.727 13.840
|
||||
|
||||
Dhrystones 19.586 21.886 12.084 9.424
|
||||
Tower 18.909 27.130 12.235 11.556
|
||||
Quicksort 17.759 27.105 15.606 13.919
|
||||
Bubble sort 18.409 31.154 19.286 16.875
|
||||
Queens 19.083 38.167 19.083 18.320
|
||||
Puzzle 22.083 44.167 23.661 21.032
|
||||
Permutations 21.019 28.564 11.604 12.242
|
||||
Int. Matrix 24.200 26.469 19.369 16.608
|
||||
Sieve 23.362 60.290 33.982 30.145
|
||||
------ ------ ------ ------
|
||||
Average 20.490 33.881 18.582 16.680
|
||||
|
||||
|
||||
|
||||
Preliminary analysis suggests that we average a roughly 3:1
|
||||
instruction count increase when translating to 80x86 code. We have
|
||||
not yet taken rigorous measurements, but the 3:1 figure is lent
|
||||
some credence by the fact that our 75MHz 486DX4 gets nearly the
|
||||
same performance as our Quadra 610. We believe that inter-block
|
||||
register allocation will noticeably improve this ratio.
|
||||
Reference in New Issue
Block a user