All the Executor source as-is from the subversion tree it was last worked on.

2026-01-23 11:16:17 +00:00 · 2008-10-05 21:48:37 -06:00
commit 3183eb6ea0
2517 changed files with 441747 additions and 0 deletions
--- a/docs/SynPaper
+++ b/docs/SynPaper
@@ -0,0 +1,426 @@
+         Syn68k:  ARDI's dynamically compiling 68LC040 emulator
+
+                      Mat Hostetter <mat@ardi.com>
+
+
+I.  Overview
+
+This document is meant to give a concise technical summary of how
+syn68k works.
+
+"Syn68k", ARDI's 68LC040 emulator, is both highly portable and
+fast.  The portable core of syn68k, which works by dynamically
+compiling 680x0 code into an efficient interpreted form, was designed
+to run on all major CPU's.  On supported architectures, syn68k can
+also translate 680x0 code into native code that the host processor
+can run directly.
+
+
+II.  Syngen
+
+ARDI's "syngen" system analyzes a lisp-like file describing the
+bit patterns and semantics of the 680x0 instruction set and produces
+lookup tables and C code for the runtime system to use.  This
+process takes place only when syn68k is built, so we can afford
+extensive analysis here.  The code and tables generated by syngen
+depend somewhat on the characteristics of the host processor; for
+example, on a little endian machine it is advantageous to byte swap
+some extracted 680x0 operands at compile time instead of at runtime.
+
+The 680x0 description file can describe multiple ways to emulate
+any particular 680x0 opcode.  The runtime system looks at what CC
+bits are live after the instruction and chooses the fastest variant
+it can legally use.  In the following example, we have two CC
+variants of lsrw; one computes no CC bits, and the other computes
+all of them:
+
+
+(defopcode lsrw_ea
+  (list 68000 amode_alterable_memory () (list "1110001011mmmmmm"))
+  (list "-----" "-----" dont_expand
+	(assign $1.muw (>> $1.muw 1)))
+  (list "CN0XZ" "-----" dont_expand
+	(list
+	 (assign ccx (assign ccc (& $1.muw 1)))
+	 (ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1))))))
+
+
+The 680x0 description file can also specify which 680x0 operands
+should be "expanded" to become implicitly known by the corresponding
+synthetic opcode.  For example, fully expanding out "addl dx,dy"
+would result in 64 synthetic opcodes, one for each combination of
+data register operands.  This results in smaller and faster synthetic
+opcodes at the expense of increasing the total number of synthetic
+opcodes.  To conserve space, we only expand out common 680x0 opcodes.
+On host architectures where we can compile to native code, we don't
+waste space by "expanding out" common synthetic opcodes.
+
+
+III.  Test suites
+
+ARDI has a large set of test suites that try thousands upon thousands
+of variations of 680x0 opcodes and compare the results to those
+generated by a real 68040.  These test suites have proven to be an
+invaluable debugging tool, both as new features are added and as
+we have ported syn68k to other architectures (notably 80x86, 680x0,
+i860 and Alpha).  Our native code support is so recent that our
+test suites do not yet adequately test all of the situations that
+arise when generating native code, but we plan to extend them in
+the near future.
+
+
+IV.  Interpreted code
+
+Our interpreted code consists of contiguous sequences of "synthetic
+opcodes" and their operands.  Syngen can generate ANSI C, but when
+compiled with gcc it uses C language extensions that make synthetic
+opcodes pointers to the C code responsible for interpreting that
+opcode.  This "threaded interpreting" entirely eliminates switch
+dispatch and loop overhead.
+
+To illustrate the above points, here is the assembly language
+generated for the synthetic opcode that would handle a fully expanded
+"addl d0,d1" when no CC bit values are required.  This is what
+gcc's 80x86 output looks like (edited for readability) after we
+run our interpreter through ARDI's Pentium-specific instruction
+scheduling Perl script:
+
+	movl	_d0,%eax	; fetch d0
+	movl	(%esi),%edi	; fetch next synthetic opcode
+	addl	%eax,_d1	; do the add
+	addl	$4,%esi		; increment synthetic PC
+	jmp	*%edi		; jump to next synthetic opcode handler
+
+We must emphasize that the preceding example is not native code
+generated by our emulator, but merely a snippet of what gcc generates
+for our interpreter.  This gives you some idea of the efficiency
+of the portable component of our emulator.
+
+
+ V.  Native code
+
+Syn68k supports optional architecture-specific native code extensions.
+On systems where they are present, the runtime system tries to
+generate native code whenever possible.  In those rare cases when
+it cannot, it reverts to our interpreted code.  Since syn68k supports
+both native and synthetic code, the runtime system automatically
+inserts gateways between the two whenever there is a transition.
+This approach allows us to gradually phase in native code handlers
+for most 680x0 instructions while leaving tricky and unimportant
+rare cases alone.
+
+Although our native code compilation engine is not architecture-specific,
+to date we have only implemented an 80x86 back end.  The 80x86
+architecture has achieved such important status in the industry
+that it makes sense for us to describe how we generate native code
+for it, even though many of these techniques would not be necessary
+on RISC architectures.
+
+We are glad that we implemented the most difficult back end first.
+We believe that, were we to have started with a RISC back end, we
+would have quite possibly architected a system where retrofitting
+the exotic mechanisms necessary for efficient 80x86 support was
+difficult.
+
+Three major problems make translating 680x0 code to 80x86 code difficult:
+
+1) The 80x86 has only 8 registers, while the 680x0 has 16.
+2) The 80x86 is little endian, while the 680x0 is big endian.
+3) The 80x86 does not have general-purpose postincrement and predecrement
+   operators, which are used frequently in 680x0 code.
+
+On the other hand, several factors make the job easier:
+
+1) The 80x86 has all of the CISC addressing modes commonly used in 680x0 code.
+2) The 80x86 has CC bits that map directly to their 680x0 counterparts
+   (except for the 680x0's X bit).
+3) The 80x86 supports 8-, 16- and 32-bit operations, (although it can only
+   support 8 bit operations on four of its registers).
+4) The 80x86 and 680x0 have analogous conditional branch instructions.
+5) The 80x86 allows unaligned memory accesses without substantial overhead.
+
+The toughest problem is the lack of registers.  On 32-register RISC
+architectures it's easy to allocate one RISC register for each
+680x0 register, but on the 80x86 a different approach is needed.
+The obvious solution is to perform full-blown inter-block register
+allocation, but we fear that using traditional compiler techniques
+would be unacceptably slow.
+
+For now, we have adopted a simple constraint: between basic blocks,
+all registers and live CC bits must reside in their canonical home
+in memory.  Within a block, anything goes.  So what liberties does
+syn68k take within a block?
+
+The 80x86 register set is treated as a cache for recently used
+680x0 registers, and the 80x86 CC bits are used as a cache for the
+680x0 CC bits.  At any particular point within a block, each 680x0
+register is either sitting in its memory home or is cached in an
+80x86 register, and each live 680x0 CC bit is either cached in its
+80x86 equivalent or stored in its memory home.  Cached registers
+may be in canonical form, may be byte swapped, may have only their
+low two bytes swapped, or may be offset by a known constant from
+their actual value.
+
+Each 680x0 instruction can require that 680x0 registers be cached
+in particular ways; the compilation engine generates the minimal
+code needed to satisfy those constraints and then calls a sequence
+of routines to generate the native code.  As each 680x0 instruction
+is processed, each 680x0 register's cache status is updated.  Dirty
+registers are canonicalized and spilled back to memory at the end
+of each block (or when we run out of 80x86 registers and we need
+to make room).
+
+We allow 680x0 registers to be cached with varying byte orders and
+offsets so that we can perform the optimizations of lazy byte
+swapping and lazy constant offsetting.  If the 680x0 program loads
+a register from memory and then ends up writing it out later, we
+avoid unnecessary byte swaps by not canonicalizing the value
+immediately.  Lazy constant offsetting mitigates the overhead of
+postincrement and predecrement side effects.  For example, this
+680x0 code:
+
+	pea		0x1
+	pea		0x2
+	pea		0x3
+	pea		0x4
+	...
+
+becomes this 80x86 code:
+
+	movl	_a7,%edi
+	movl	$0x01000000,-4(%edi)	; "push" big-endian constant
+	movl	$0x02000000,-8(%edi)
+	movl	$0x03000000,-12(%edi)
+	movl	$0x04000000,-16(%edi)
+	... <more uses of a7 may follow, and they'll use %edi>
+	subl	$16,%edi
+	movl	$edi,_a7
+	...
+
+
+As mentioned above, we use the 80x86 condition code bits as a cache
+for the real 680x0 CC bits.  Although live cached CC bits are
+occasionally spilled back to memory because some 80x86 instruction
+is about to clobber them, this trick almost always works.  Using
+80x86 CC bits, we can frequently get away with extremely concise
+code sequences; for example, a 680x0 compare and conditional branch
+becomes an 80x86 compare and conditional branch.
+
+
+VI.  Self-modifying code
+
+Like most dynamically compiling emulators, syn68k doesn't detect
+self-modifying code; the overhead is too high.  Fortunately,
+self-modifying programs don't work on the real 68040 either.  We
+rely on the program making explicit system calls to flush the caches
+whenever 680x0 code may have been modified or created.  Some programs
+(like HyperCard) flush the caches very often, which can cause real
+performance headaches if code is continuously recompiled.  We have
+solved this problem by checksumming 680x0 blocks as they are compiled
+and only decompiling blocks which fail their checksums.  This
+optimization alone sped up some HyperCard stacks by a factor of
+three or so.
+
+
+
+VII.  Responsiveness
+
+Responsiveness is a concern for any dynamic compiler.  Fortunately,
+syn68k is largely driven by automatically generated lookup tables
+so compilation speed is good.  Like other dynamic compilers, syn68k
+only bothers to compile 680x0 code when it encounters it for the
+first time.
+
+When syn68k encounters new code, it compiles other 680x0 code that
+it can reach from there but does not compile through jsr's.  Only
+when a jsr is actually executed does syn68k compile the target
+routine.  Once that target routine is compiled, syn68k modifies
+the jsr handler to point directly to the target routine so that
+the jsr will be extremely fast the second time it is executed.
+We've found that lazily compiling through jsr's does a good job of
+avoiding compilation lag that might annoy the user.
+
+Syn68k does not attempt to generate native code for a basic block
+until that block (or a nearby one) has been executed 50 times.
+This saves memory and some compilation time, although we haven't
+noticed any particular sluggishness when compiling to native code.
+
+
+VIII.  Other optimizations
+
+Syn68k maintains an internal "jsr stack" to speed up the common
+case of jsr/rts.  We realize that the rts address might have been
+fiddled with, so the rts handler verifies at runtime that the rts
+address matches the tag on top of the jsr stack.  If it matches,
+syn68k does a fast jump.  If it doesn't, syn68k looks up the code
+corresponding to the rts address in a hash table.
+
+
+IX.  Neat hacks
+
+The low-level code generation routines for the 80x86 back end are
+machine generated from assembly language templates.  Thousands of
+operand permutations for 80x86 instructions of interest are run
+through the system's assembler and analyzed to derive the rules
+the assembler uses to create binaries.  Those rules are encapsulated
+into C code and compiled into syn68k so we can generate binaries
+on the fly.  Here is a sample template:
+
+  { "i386_leal_indoff", "", "", "", "", "-",
+      "leal %0(%1),%2",
+      { "offset", "base", "dst" },
+      { { SIZE_32, CONSTANT, IN }, { SIZE_32, REGISTER, IN },
+	    { SIZE_32, REGISTER, OUT } } },
+
+This approach has saved us countless hours of debugging and allows
+our system to automatically perform the same optimizations as the
+host system's assembler.
+
+We've annotated our 80x86 descriptions with information about
+Pentium pairability so that future versions of syn68k can schedule
+the native code we generate (we already schedule our main interpreter
+when we build syn68k).
+
+
+X.  Future optimizations
+
+We are working on a simple inter-block register allocation algorithm.
+
+By relocating most 680x0 register->80x86 register moves to the
+beginning of each block, we can improve Pentium pairability and
+reduce 80486 and Pentium address generation pipeline stalls.
+
+Now that we compile to native code, A-line trap overhead is becoming
+significant.  We may soon compile A-line traps to native code that
+directly calls the appropriate ROMlib routine with the appropriate
+arguments (checking at runtime to make sure that neither the trap
+nor the A-line vector has been patched out, of course).
+
+
+XI.  Code examples
+
+Here are two sample 680x0 code sequences from real applications,
+and the 80x86 code that syn68k generates for them.  We chose these
+code sequences specifically to showcase several of the techniques
+we use, so you shouldn't use them as a substitute for benchmarks.
+Not all 680x0 code translates as well as these examples do, but
+these examples are far from exotic.
+
+
+Example 1 (Solarian):
+
+680x0 code:
+
+	addqb	#1,a4@(1)
+	movel	#0,d0
+	moveb	a4@,d0
+	swap	d0
+	clrw	d0
+	swap	d0
+	asll	#2,d0
+	lea	a5@(-13462),a0
+	addal	d0,a0
+	moveal	a0@,a0
+	movel	#0,d0
+	moveb	a4@(1),d0
+	cmpw	a0@,d0
+	bcs	0x3fffee2
+
+
+80x86 code:
+
+	movl	_a4,%edi		; addqb #1,a4@(1)
+	addb	$0x1,0x1(%edi)
+	xorl	%ebx,%ebx		; movel #0,d0
+	movb	(%edi),%bl		; moveb a4@,d0
+	rorl	$0x10,%ebx		; swap d0
+	xorw	%bx,%bx			; clrw d0
+	rorl	$0x10,%ebx		; swap d0
+	shll	$0x2,%ebx		; asll #2,d0
+	movl	_a5,%esi		; lea a5@(-13462),a0
+	leal	0xffffcb6a(%esi),%edx
+	addl	%ebx,%edx		; addal d0,a0
+	movl	(%edx),%edx		; moveal a0@,a0
+	xorl	%ebx,%ebx		; movel #0,d0
+	movb	0x1(%edi),%bl		; moveb a4@(1),d0
+	bswap	%edx			; cmpw a0@,d0
+	movw	(%edx),%cx
+	rorw	$0x8,%cx
+	cmpw	%cx,%bx
+	movl	%edx,_a0		; <spill dirty 68k
+	movl	%ebx,_d0		;  registers back to memory>
+	jb	0x6fae0c		; bcs 0x3fffee2
+	jmp	0x6faf0c		; <go to "fall through" code>
+
+
+
+
+Example 2 (PageMaker):
+
+680x0 code:
+
+	movel	#0,d2
+	moveb	d0,d2
+	lslw	#8,d0
+	orw	d0,d2
+	movel	d2,d0
+	swap	d2
+	orl	d2,d0
+	movel	a0,d2
+	lsrb	#1,d2
+	bcc	0x3fffed4
+
+80x86 code:
+
+	xorl	%ebx,%ebx		; movel #0,d2
+	movl	_d0,%edx		; moveb d0,d2
+	movb	%dl,%bl
+	shlw	$0x8,%dx		; lslw #8,d0
+	orw	%dx,%bx			; orw d0,d2
+	movl	%ebx,%edx		; movel d2,d0
+	rorl	$0x10,%ebx		; swap d2
+	orl	%ebx,%edx		; orl d2,d0
+	movl	_a0,%ecx		; movel a0,d2
+	movl	%ecx,%ebx
+	shrb	%bl			; lsrb #1,d2
+	movl	%ebx,_d2		; <spill dirty 68k
+	movl	%edx,_d0		;  registers back to memory>
+	jae	0x3b734c		; bcc 0x3fffed4
+	jmp	0x43d48c		; <go to "fall through" 68k code>
+
+
+XII.  Benchmarks
+
+These performance numbers were computed with Speedometer 3.23.
+We've removed the floating point tests from the list since they do
+not measure syn68k's speed.  Syn68k contains no special provisions
+for Speedometer's benchmarks and we believe that these numbers are
+indicative of syn68k's performance for many other CPU-intensive
+programs.
+
+
+		Quadra	Pentium	486DX4	486DX/2
+                  610	 90MHz	 75MHz	 66MHz
+		------	------	------	------
+CPU		16.018	28.833	15.727	13.840
+
+Dhrystones	19.586	21.886	12.084	 9.424
+Tower		18.909	27.130	12.235	11.556
+Quicksort	17.759	27.105	15.606	13.919
+Bubble sort	18.409	31.154	19.286	16.875
+Queens		19.083	38.167	19.083	18.320
+Puzzle		22.083	44.167	23.661	21.032
+Permutations	21.019	28.564	11.604	12.242
+Int. Matrix	24.200	26.469	19.369	16.608
+Sieve		23.362	60.290	33.982	30.145
+		------	------	------	------
+Average		20.490	33.881	18.582	16.680
+
+
+
+Preliminary analysis suggests that we average a roughly 3:1
+instruction count increase when translating to 80x86 code.  We have
+not yet taken rigorous measurements, but the 3:1 figure is lent
+some credence by the fact that our 75MHz 486DX4 gets nearly the
+same performance as our Quadra 610.  We believe that inter-block
+register allocation will noticeably improve this ratio.