executor/docs/MacHack_96.rtf

{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f16\fnil Palatino;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}}{\colortbl\red0\green0\blue0;
\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\tqc\tx4320\tqr\tx8640 \f16\fs20 \sbasedon0\snext243 footer;}{\s245
\f16\fs18\up6 \sbasedon0\snext0 footnote reference;}{\s246 \f16\fs20 \sbasedon0\snext246 footnote text;}{\s250\li720 \f16\fs20\ul \sbasedon0\snext0 heading 6;}{\s251\li720 \b\f16\fs20 \sbasedon0\snext0 heading 5;}{\s252\li360 \f16\ul \sbasedon0\snext0
heading 4;}{\s253\li360\sb120 \b\f16 \sbasedon0\snext0 heading 3;}{\s254\sb120 \b\f21 \sbasedon0\snext0 heading 2;}{\s255\sb240 \b\f21\ul \sbasedon0\snext0 heading 1;}{\f16\fs20 \sbasedon222\snext0 Normal;}}{\info{\title Executor Internals}{\subject
How to efficiently run Mac Programs on PCs}{\author ctm + mat}}\margl720\margr720\margt-1440\margb-1440\gutter1080\widowctrl\ftnbj\fracwidth\margmirror \sectd \sbknone\linemod0\linex0\cols1\endnhere {\footer \pard\plain \s243\tx4320\tqr\tx8640 \f16\fs20
\tab Executor Internals\tab \chpgn \par
}\pard\plain \qc \f16\fs20 {\fs36 Executor Internals:\par
How to Efficiently Run Mac Programs on PCs\par
\par
}{\plain \f16 Mathew J. Hostetter }{\plain \f22 <mat@ardi.com>}{\plain \f16 \par
Clifford T. Matthews }{\plain \f22 <ctm@ardi.com>\par
}{\fs18 After MacHack '96, this paper will be available from}{\f22\fs18  http://www.ardi.com}{\fs18 \par
}{\plain \f16 \par
}\pard Executor is a commercial Macintosh emulator that uses no software from Apple, but is still able to run much 680x0 based Macintosh software faster on Pentiums than the same software runs on 680x0
 based Macs.  This paper contains some implementation details, including descriptions of Executor's synthetic CPU, graphics subsystem and debugging environment.  Portability issues, current limitations and future plans are also presented.{\fs36 \par
}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \s255\sb240 \b\f21\ul Executor Overview\par
\pard\plain \s254\sb120 \b\f21 What Executor is\par
\pard\plain \f16\fs20 Executor is a commercial emulator that allows PCs to run many Macintosh applications.  Executor does not require Macintosh ROMs or a Macintosh System file and contains no Appple code it
self.  Executor was written entiredly by engineers without Macintosh backgrounds who have not disassembled any of Apple's ROMs or System file.\par
\pard\plain \s254\sb120 \b\f21 Limitations\par
\pard\plain \f16\fs20
Because Executor was written strictly from publicly available documentation (Inside Macintosh, Tech. Notes, etc.), programs which make use of undocumented features of MacOS may fail under Executor.  Furthermore,  there are some portions of MacOS that we ha
ven't implemented yet.  Executor is sufficiently large that there are probably bugs in some of o
ur code as well.  We realize these are major limitations, but this paper is primarily concerned with implementation details that are interesting to our fellow programmers as opposed to feature sets and limitations which are of more concern to end users and
 our marketing department.\par
\pard\plain \s255\sb240 \b\f21\ul Design Goals\par
\pard\plain \f16\fs20 Our goal is for Executor to be accurate, fast and portable.  Beyond that, completeness is a secondary issue.\par
\par
Accuracy means that each subsystem that we implement should behave exactly according to the functional specs for the subsystem that we've derived from a combination of reading documentation, writing test cases and running programs under Executor.\par
\par

Fast is harder to qualify.  As programmers we like to use advanced techniques that will result in programs running under Executor as quickly as possible.  Unfortunately, we have a limited number of engineer hours in a week and most engineering time is spen
t implementing new subsystems or finding and fixing subtle incompatibilities.  We're proud of the speed that we've obtained so far, but we know that we can do better in the future.\par
\par
Portability is the ability to support multiple platforms from the same source base.  A platform is a combination of CPU, operating system and graphics device or windowing system.  Executor currently supports Intel 80[3456]
86 and compatible CPUs, Motorola m680[34]0 CPUs, the operating systems DOS, Linux and NEXTSTEP and can interact with VGA, SVGA,  Display PostScript and X-Windows.  To get the best performance on some architectures we do use arc
hitecture specific code, but  we also write portable versions to be used where the platform specific versions can't be.  Although not supported as a product, Executor was ported to DEC's Alpha, but since ARDI has no Alpha and DEC lost interest, the A
lpha port is no longer current.  Although not recently, ROMlib, ARDI's rewrite of the MacOS OS and Toolbox routines, has been ported to a wide variety of platforms, including MIPS , m88k , Clipper, IBM RT, SPARC and even VAX based systems.\par
\par

Those three design goals have led us in the direction of dynamic code generation for both the 680x0 emulation and for our blitter.  In both cases we use high level descriptions of what we want accomplished and then use special purpose tools at compile time
 to translate these high level descriptions into constructs that we can then use at run time.\par
\par
High level descriptions are less error p
rone, allowing us to document the semantics that we wish to see in our synthetic CPU or blitter using a special purpose language that is directly suited to the task at hand, rather than a general purpose language like C or the traditional language of speed
 freaks -- assembler.\par
\par
High level descriptions also lend themselves to portability.  We have our tools generate portable constructs for the general case and, with a little more programming effort, faster architecture specific constructs for the architec
tures that we consider most important.\par
\par
Since the conversion from high level description to useful construct takes place at compile time, there is no need to worry about the CPU cycles spent doing the mapping.  This allows us to design our code by thinking:  "At {\i runtime}
, what would be the optimal instruction sequence to perform a specific task?"  Once we know the answer to that question we can ask:  "How can we represent at a high level, the task is being accomplished by that optimal set of instructions?".
 Then, the final question is "Given what we want to generate and how we want to represent it, what does the compile time mapping look like?".   The entire time we're pondering those three questions, we're keeping accuracy, portability and efficiency in min
d.\par
\pard\plain \s255\sb240 \b\f21\ul Executor Subsystems\par
\pard\plain \s254\sb120 \b\f21 Synthetic CPU\par
\pard\plain \s253\li360\sb120 \b\f16 Overview\par
\pard\plain \f16\fs20 Syn68k is the name of the synthetic CPU that Executor 2 uses.  Syn68k is both highly portable and fast.  The portable core of Syn68k, which works by dynamically compiling 680x0 code into an efficient interprete
d form, was designed to run on all major CPU's.  On supported architectures, Syn68k can also translate 680x0 code into native code that the host processor can run directly.\par
\pard\plain \s253\li360\sb120 \b\f16 Syngen\par
\pard\plain \f16\fs20 Syngen analyzes a lisp-like file describing the bit patterns and semantics of the 680x0 instruction set and produces lookup tables and C code for the runtime system to use
.  The code and tables generated by syngen depend somewhat on the characteristics of the host processor; for example, on a little endian machine it is advantageous to byte swap some extracted 680x0 operands at translation time instead of at runtime.\par
\par
The 680x0 description file can describe multiple ways to emulate any particular 680x0 opcode.  The runtime system looks at what CC bits are live after the instruction and chooses the fastest variant it can legally use.  In Figure 1
, we have two CC variants of lsrw; one computes no CC bits, and the other computes all of them.\par
\par
The 680x0 description file can also specify which 680x0 operands should be "expanded" to become implicitly known by the corresponding synthetic opcode.  For example, fully expanding out "addl dx,dy" would result in 64 synthetic opcodes, one \par
for each combination of data register operands.  This results in smaller and faster synthetic opcodes at the expense of increasing the total number of synthetic opcodes.  To conserve space, we only "expand out "
common 680x0 opcodes. On host architectures where we can compile to native code, we don't waste space by "expanding out" common synthetic opcodes.\par
\pard\plain \s253\li360\sb120 \b\f16 Interpreted Code\par
\pard\plain \f16\fs20 Our interpreted code consists of contiguous sequences of "synthetic opcodes" and their operands.  Syngen can generate ANSI C, but when \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs
\phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 (defopcode lsrw_ea\par
  (list 68000 amode_alterable_memory () (list "1110001011mmmmmm"))\par
  (list "-----" "-----" dont_expand\par
\tab (assign $1.muw (>> $1.muw 1)))\par
  (list "CN0XZ" "-----" dont_expand\par
\tab (list\par
\tab  (assign ccx (assign ccc (& $1.muw 1)))\par
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\f22 \tab  (ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1))))))\par
\par
}{\plain \f16 Figure 1.  Syn68k description of lsrw}{\f22 \par
}\pard \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 compiled with GCC it uses C language extensions that make synthetic opcodes be
pointers to the C code responsible for interpreting that opcode.  This "threaded interpreting" entirely eliminates switch dispatch and loop overhead.\par
\pard\plain \s253\li360\sb120 \b\f16 Native Code\par
\pard\plain \f16\fs20 For the 80x86 architecture, Syn68k supports an optional architecture-specific native code extension that
 tries to generate native code whenever possible.  In those rare cases when it cannot, it reverts to our interpreted code.  Since Syn68k
 supports both native and synthetic code, the runtime system automatically inserts gateways between the two whenever there is a transition. \par
\par
Three major problems make translating 680x0 code to 80x86 code difficult:\par
\par
\pard \fi-180\li540 \bullet  The 80x86 has only 8 registers, while the 680x0 has 16.\par
\par
\bullet  The 80x86 is little endian, while the 680x0 is big endian.\par
\par
\bullet  The 80x86 does not have general-purpose postincrement and predecrement  operators, which are used frequently in 680x0 code.\par
\pard \par
On the other hand, several factors make the job easier than it would be for RISC machines:\par
\par
\pard \fi-180\li540 \bullet  The 80x86 has all of the CISC addressing modes commonly used in 680x0 code.\par
\par
\bullet  The 80x86 has CC bits that map directly to their 680x0 counterparts (except for the 680x0's X bit).\par
\par
\bullet  The 80x86 supports 8-, 16- and 32-bit operations, (although it can only support 8 bit operations on four of its registers).\par
\par
\bullet  The 80x86 and 680x0 have analogous conditional branch instructions.\par
\par
\bullet  The 80x86 allows unaligned memory accesses without substantial overhead.\par
\pard \par

The toughest problem is the lack of registers.  On 32-register RISC architectures it's easy to allocate one RISC register for each 680x0 register, but on the 80x86 a different approach is needed. The obvious solution is to perform full-blown inter-block re
gister allocation, but we fear that using traditional compiler techniques would be unacceptably slow.\par
\par
For now, we have adopted a simple constraint: between basic blocks, all registers and live CC bits must reside in their canonical home in memory.  Within a block, anything goes.  So what liberties does Syn68k take within a block?\par
\par

The 80x86 register set is treated as a cache for recently used 680x0 registers, and the 80x86 CC bits are used as a cache for the 680x0 CC bits.  At any particular point within a block, each 680x0 register is either sitting in its memory home or is cached
in an 80x86 register, and each live 680x0 CC bit
 is either cached in its 80x86 equivalent or stored in its memory home.  Cached registers may be in canonical form, may be byte swapped, may have only their low two bytes swapped, or may be offset by a known constant from their actual value.\par
\par
Each 680x0 instruction can require that 680x0 registers be cached in particular ways.  For example, {\f22 movel d0, mem} requires d0 to be cached in big endian byte order.  T
he compilation engine generates the minimal code needed to satisfy those constraints and then calls a sequence of routines to generate the native code.  As each 680x0 instruction is
processed, each 680x0 register's cache status is updated.  Dirty registers are canonicalized and spilled back to memory at the end of each block (or when we run out of 80x86 registers and we need to make room).\par
\par

We allow 680x0 registers to be cached with varying byte orders and offsets so that we can perform the optimizations of lazy byte swapping and lazy constant offsetting.  If the 680x0 program loads a register from memory and then ends up writing it out later
, we avoid unnecessary byte swaps by not canonicalizing the value immediately.  Lazy constant offsetting mitigates \par
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 \f16\fs20 {\f22 \tab pea\tab \tab 0x1\par
\tab pea\tab \tab 0x2\par
\tab pea\tab \tab 0x3\par
\tab pea\tab \tab 0x4\par
}\tab ...\par
\par
becomes this 80x86 code:\par
\par
{\f22 \tab movl\tab _a7,%edi\par
\tab movl\tab $0x01000000,-4(%edi)\tab ; "push" big-endian constant\par
\tab movl\tab $0x02000000,-8(%edi)\par
\tab movl\tab $0x03000000,-12(%edi)\par
\tab movl\tab $0x04000000,-16(%edi)\par
\tab ... <more uses of a7 may follow, and they'll use %edi>\par
\tab subl\tab $16,%edi\par
\tab movl\tab $edi,_a7\par
}\tab ...\par
\pard \qc\box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 {\plain \f16 Figure 2.  Lazy Constant Offsetting\par
}\pard \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 the overhead of postincrement and predecrement side effects.  Figure 2 is an example of lazy constant offsetting.\par
\par

As mentioned above, we use the 80x86 condition code bits as a cache for the real 680x0 CC bits.  Although live cached CC bits are occasionally spilled back to memory because some 80x86 instruction is about to clobber them, this trick almost always works.
Using 80x86 CC bits, we can frequently get away with extremely concise code sequences; for example, a 680x0 compare and conditional branch becomes an 80x86 compare and conditional branch.\par
\pard\plain \s253\li360\sb120 \b\f16 Self-modifying Code\par
\pard\plain \f16\fs20 Like most dynamically compiling emulators, Syn68k
 doesn't detect self-modifying code; the overhead is too high.  Fortunately, self-modifying programs don't work on the real 68040 either.  We rely on the program making explicit system calls to flush the caches whenever 680x0 code may have been modified or
 created.  Some programs (like HyperCard) flush the caches very often, wh
ich can cause real performance headaches if code is continuously recompiled.  We have solved this problem by checksumming 680x0 blocks as they are compiled and only decompiling blocks which fail their checksums.  This optimization alone sped up some HyperC
ard stacks by a factor of three or so.\par
\pard\plain \s253\li360\sb120 \b\f16 Examples\par
\pard\plain \f16\fs20 Figure 3 contains two sample 680x0 code sequences from real applications, and the 80x86 code that Syn68k generates for them.  We chose these code sequences specifically to showcase several of the techniques we use, so
 you shouldn't use them as a substitute for benchmarks. Not all 680x0 code translates as well as these examples do, but these examples are far from exotic.\par
\pard\plain \s254\sb120 \b\f21 \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 \f16\fs20 Example 1 (Solarian):\par
\par
680x0 code:\par
\par
{\f22 \tab addqb\tab #1,a4@(1)\par
\tab movel\tab #0,d0\par
\tab moveb\tab a4@,d0\par
\tab swap\tab d0\par
\tab clrw\tab d0\par
\tab swap\tab d0\par
\tab asll\tab #2,d0\par
\tab lea\tab a5@(-13462),a0\par
\tab addal\tab d0,a0\par
\tab moveal\tab a0@,a0\par
\tab movel\tab #0,d0\par
\tab moveb\tab a4@(1),d0\par
\tab cmpw\tab a0@,d0\par
\tab bcs\tab 0x3fffee2}\par
\par
\par
80x86 code:\par
\par
{\f22 \tab movl\tab _a4,%edi\tab \tab ; addqb #1,a4@(1)\par
\tab addb\tab $0x1,0x1(%edi)\par
\tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d0\par
\tab movb\tab (%edi),%bl\tab \tab ; moveb a4@,d0\par
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d0\par
\tab xorw\tab %bx,%bx\tab \tab ; clrw d0\par
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d0\par
\tab shll\tab $0x2,%ebx\tab \tab ; asll #2,d0\par
\tab movl\tab _a5,%esi\tab \tab ; lea a5@(-13462),a0\par
\tab leal\tab 0xffffcb6a(%esi),%edx\par
\tab addl\tab %ebx,%edx\tab \tab ; addal d0,a0\par
\tab movl\tab (%edx),%edx\tab \tab ; moveal a0@,a0\par
\tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d0\par
\tab movb\tab 0x1(%edi),%bl\tab ; moveb a4@(1),d0\par
\tab bswap\tab %edx\tab \tab \tab ; cmpw a0@,d0\par
\tab movw\tab (%edx),%cx\par
\tab rorw\tab $0x8,%cx\par
\tab cmpw\tab %cx,%bx\par
\tab movl\tab %edx,_a0\tab \tab ; <spill dirty 68k\par
\tab movl\tab %ebx,_d0\tab \tab ;  registers back to memory>\par
\tab jb\tab 0x6fae0c\tab \tab ; bcs 0x3fffee2\par
\tab jmp\tab 0x6faf0c\tab \tab ; <go to "fall through" code>}\par
\pard \box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 \page Example 2 (PageMaker):\par
\par
680x0 code:\par
\par
{\f22 \tab movel\tab #0,d2\par
\tab moveb\tab d0,d2\par
\tab lslw\tab #8,d0\par
\tab orw\tab d0,d2\par
\tab movel\tab d2,d0\par
\tab swap\tab d2\par
\tab orl\tab d2,d0\par
\tab movel\tab a0,d2\par
\tab lsrb\tab #1,d2\par
\tab bcc\tab 0x3fffed4}\par
\par
80x86 code:\par
\par
{\f22 \tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d2\par
\tab movl\tab _d0,%edx\tab \tab ; moveb d0,d2\par
\tab movb\tab %dl,%bl\par
\tab shlw\tab $0x8,%dx\tab \tab ; lslw #8,d0\par
\tab orw\tab %dx,%bx\tab \tab ; orw d0,d2\par
\tab movl\tab %ebx,%edx\tab \tab ; movel d2,d0\par
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d2\par
\tab orl\tab %ebx,%edx\tab \tab ; orl d2,d0\par
\tab movl\tab _a0,%ecx\tab \tab ; movel a0,d2\par
\tab movl\tab %ecx,%ebx\par
\tab shrb\tab %bl\tab \tab \tab ; lsrb #1,d2\par
\tab movl\tab %ebx,_d2\tab \tab ; <spill dirty 68k\par
\tab movl\tab %edx,_d0\tab \tab ;  registers back to memory>\par
\tab jae\tab 0x3b734c\tab \tab ; bcc 0x3fffed4\par
\tab jmp\tab 0x43d48c\tab \tab ; <go to "fall through" 68k code>\par
\par
}\pard \qc\box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 {\plain \f16 Figure 3.  680x0 -> 80x86 examples\par
}\pard\plain \s254\sb120 \b\f21 \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \s254\sb120 \b\f21 Graphics\par
\pard\plain \s253\li360\sb120 \b\f16 SVGA Graphics\par
\pard\plain \f16\fs20 The DOS world is one of standards.  {\i Many}
 standards.  Standards made by engineers who were even more short-sighted than the folks who brought you ROM85, only to be replaced by SysEnvirons which was then replaced by Gestalt.  The first color graphics adapter for the PC (CGA) was replaced with EGA,
 which was then replaced by VGA, which eventually gave way to several different Super Video Graphics Array (SVGA) cards.\par
\par

SVGA cards have a couple of properties that make them less than perfect targets for the output of Macintosh emulators.  First, the default is for SVGA's video memory to only be mapped into the PC address space through a 64k window (or bank).  If you want t
o display 640x480x8 bits you need to write 64k of information to the 64k screen address range, then tell the video card that you want that same address to
represent a different 64k chunk of the screen, then you write to that address range again, then you switch banks again, and so forth.\par
\par
The second major complication is that under DPMI, the address space that contains the SVGA video memory is not in the same address space
 that a 32-bit application uses.  For those of you used to programming in a flat address space, it might be hard to believe that you need special machine language address space overriding prefixes to access screen memory, but
under DPMI 0.9 (which is the version of DPMI that Microsoft supports; we wouldn't have to do this under 1.0) "selector" overrides really are necessary.\par
\pard\plain \s253\li360\sb120 \b\f16 \page Blitter Overview\par
\pard\plain \f16\fs20
A Region is a data structure that describes a set of pixels.  Regions can be created by the application by calling various MacOS toolbox routines.  In addition the toolbox routines themselves sometimes create Regions for their own purposes.  \par
\par
A blitter is a set of software or hardware which takes sets of bits, representing pixels,
 and combines them with other sets of bits in a variety of different ways.  A Region blitter is a blitter that processes pixels by Regions (rather than by rectangles or rectangle lists).\par
\pard\plain \s253\li360\sb120 \b\f16 A Simple Blitter \par
\pard\plain \f16\fs20 One way to write a simple Region blitter is to start with a s
ubroutine that parses the start/stop pairs of a Region scanline and draws the corresponding pixels.  This subroutine is then called once for each row of pixels to be displayed.\par
\par

Unfortunately, this approach is slow since each scanline gets re-parsed every time it is drawn.  The Region for a 300 pixel tall rectangle consists of a single scanline with a repeat count of "300"; this "simple Region blitter" will parse that scanline 300
 times! That's a lot of redundant work.\par
\par
There are many possible ways to get a
way with parsing each scanline only once.  One approach is to convert the start/stop pairs into a bit mask where the bits in the mask correspond to the bits in the target bitmap that are to be changed.  The inner blitting loop then becomes an exercise in b
itwise arithmetic.  In C, such a loop might look something like this:{\f22 \par
\par
for (x = left; x < right; x++)\par
    dst[x] = (dst[x] & ~mask[x]) \par
\tab | (pattern_value & mask[x]);\par
\par
}That's not bad, but we can do better.\par
\pard\plain \s253\li360\sb120 \b\f16 A Dynamically Recompiling Blitter \par
\pard\plain \f16\fs20 Using an explicit bit mask array is unnecessarily slow in the comm
on case of filling a rectangle.  For a rectangular Region, mask[x] is usually all one bits, making the bit munging a waste of time.  And even when the masks are never solid (e.g. when drawing a thin vertical line), this technique is still unnecessarily slo
w.  As it turns out, even the cycles the CPU spends loading mask bits from memory are unnecessary.  Furthermore, even if we were satisfied with the level of performance that C code like the above provides, we couldn't use it on a stock SVGA system because
it wouldn't know how to access the SVGA portion of memory.\par
\par

Executor's blitter uses the techniques of partial evaluation and dynamic code generation to eliminate redundant work and also give us access to SVGA memory.  On the 80x86 each scanline is quickly translated into executable code, and that code gets executed
 once each time the scanline needs to be drawn.  On non-80x86 platforms, each scanline is compiled into threaded code which is executed by a machine-generated interpreter to draw the scanlines.\par
\par

Before describing how the dynamic compilation process works, let's take a look at an example.  Consider the case where a 401x300 rectangle is to be filled with white pixels (pixel value zero on the Macintosh).  This might happen, for example, when erasing
a window. Furthermore, let's assume that the target bitmap has four bits per pixel, since that's somewhat tricker to handle than 8 bits per pixel.  Figure 4 shows the subroutine that Executor dynamically generates to draw this rectangle on a Pentium.
\par
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 loop:\tab andl\tab $0xff,0x50(%edi)\tab \tab ; clear leftmost 6 boundary pixels\par
\tab addl\tab $0x54,%edi\tab \tab \tab ; set up pointer for loop\par
\tab movl\tab $0x31,%ecx\tab \tab \tab ; set up loop counter\par
\tab rep\par
\tab stosl\tab \tab \tab \tab \tab ; slam out 49 aligned longs\par
\tab andl\tab $0xffff0f00,0x0(%edi)\tab ; clear 3 right boundary pixels\par
\tab addl\tab $0x28,%edi\tab \tab \tab ; move to next row\par
\tab decl\tab %edx\tab \tab \tab \tab ; decrement # of rows left\par
\tab jne\tab loop\tab \tab \tab \tab ; continue looping if appropriate\par
\tab ret\tab \tab \tab \tab \tab ; we're done!\par
\par
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\plain \f16 Figure 4.  Dynamically generated blitting code}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 \par
\page \page This code, when called with the proper values in its input registers, will draw the entire rectangle.  Note how the inner loop is merely a\par
"{\f22 rep ; stosl}
"...it doesn't get much more concise than that!  The astute reader will know that on certain 80x86 processors "rep ; stosl" is not the fastest possible way to set a range of memory.  This is true, but because our code generation is dynamic, in the future w
e can tailor the specific code sequence generated to the processor on which Executor is currently running.  The blitter already does this when it needs to emit a byte swap; on the 80486 and up we use the "bswap" instruction, and on th
e 80386 (which doesn't support "bswap") we use a sequence of rotates.\par
\par
One thing you may notice in this example is that the bit masks used to clear the boundary pixels look strange.  They are actually correct, since 80x86 processors are little endian.\par
\par
Unlike some processors, such as the 68040, the 80x86 instruction and data caches are always coherent.  Consequently, no cache flushes need to be performed before the dynamically created code can be executed.\par
\par
Figure 5 contains another example, this time drawn from a real application.  The program "Globe", by Paul Mercer, draws a spinning globe \par
\par
on the screen as fast as it can.  Each "globe frame" is a 128x128 Pixmap.  Here is the code that Executor generates and runs when Globe uses CopyBits to transfer one frame to the screen at 8 bits per pixel.\par
\par
Again the inner loop is very tight, just a "rep ; movsl" this time.\par
\pard\plain \s253\li360\sb120 \b\f16 \page Meta-Assembler \par
\pard\plain \f16\fs20
No matter how fast the generated code, if Executor spends too much time generating that code then any speedup will be negated by the increased time required for dynamic compilation.  Consequently, the dynamic compilation from Region to 80x86 code needs to
be fast. We solved this problem with a "meta-assembler" written in Perl.\par
\par
Whereas an assembler tells a computer how to translate assembly instructions into machine code, our meta-assembler tells the computer how to generate tiny translators.  These translators will then be used to translate pixel manipulation requests into
machine code.  Another way of looking at it is that the meta-assembler generates code that generates code.  This meta-assembly process is done only once:  when Executor is compiled.\par
\par
The blitter operates on aligned longs in the destination bitmap.  As the compilation engine strides through the scanline's start/stop pairs from left to right, it identifies which bits in each long are part of the Region and determines which of several pi
xel manipulation requests to issue to the tiny translators that were created by the meta-assembler.\par
\par
\pard \fi-180\li540 \bullet  Some but not all bits in the current long are in the Region.\par
\par
\bullet  All bits in the current long are in the Region.\par
\par
\bullet  All bits in this long and the next long are in the Region.\par
\par
\bullet  All bits in this long and the next two longs are in the Region.\par
\par
\bullet  All bits in this long and the next three longs are in the Region.\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 loop:\tab movl   $0x20,%ecx\tab \tab
; set up loop counter for 32 longs\par
\tab rep\par
\tab movsl\tab \tab \tab \tab ; copy one row (128 bytes)\par
\tab addl   $0xffffff00,%esi\tab ; advance to previous src row\par
\tab addl   $0xfffffd00,%edi\tab ; advance to previous dst row\par
\tab decl   %edx\tab \tab \tab ; decrement # of rows remaining\par
\tab jne    loop\par
\tab ret\par
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\plain \f16 Figure 5.  Blitting code from Globe}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \fi-180\li360 \f16\fs20 \page \page \bullet
 More than four contiguous longs are\par
\pard \li540 completely in the Region, and the number of longs equals 0 mod 4.\par
\pard \fi-180\li540 \par
\bullet  More than four contiguous longs are completely in the Region, and the number of longs equals 1 mod 4.\par
\par
\bullet  More than four contiguous longs are completely in the Region, and the number of longs equals 2 mod 4.\par
\par
\bullet  More than four contiguous longs are completely in the Region, and the number of longs equals 3 mod 4.\par
\pard \par
The particular case encountered determines which function pointer to load from a lookup table corresponding to the current drawing mode. For example, the "patCopy" drawing mode has on
e table of function pointers, "patXor" another.  There are also some special case tables for drawing patterns that are either all zero bits or all one bits.\par
\par
The main blitter doesn't care what drawing mode is being used, since it does all mode-specific work through the supplied function pointer table.\par
\par
Each function pointer points to a function that generates 80x86 code for the appropriate case.  For example, one function generates code for a "patCopy" to three contiguous longs, one generates code for "patX
or" only to certain specified bits within one long, etc.\par
\par

The blitter compilation engine marches through the Region scanline from left to right, calling code generation functions as it goes.  The generated code is accrued into a 32-byte aligned buffer on the stack. In this way, the blitter constructs a subroutine
 to draw the Region.\par
\par
The compilation engine isn't very complicated.  The tricky part is the numerous generation subroutines, which need to be fast since they are called so often and need to be easy to writ
e since there are so many of them. For each drawing mode there's one for each case the compilation engine cares about.  For pattern drawing modes, there are separate specialized sub
routines for cases like patterns that can be entirely expressed in one 32-bit value ("short/narrow") patterns, patterns which can be expressed as one 32-bit value for each row, but which vary per row ("tall/narrow"), as well as "wide" variants of both. Bey
ond that, there are some versions specialized for 80486 and higher processors (which have the "bswap" instruction).\par
\par
Generating fast and robust code generators is where the Perl meta-assembler comes into play.\par
\par
The meta-assembler takes as input an assembly language template, and generates as output Pentium-scheduled assembly code that outputs an 80x86 binary for the input template.  This process only takes place when Executor is compiled.
Got it?  This can be a little confusing, so a few examples are in order.\par
\par
Here is perhaps the simplest template:\par
\par
{\f22   @meta copy_short_narrow_1\par
\tab movl\tab %eax,@param_offset@(%edi)\par
  @endmeta\par
}\par
This template describes what should be done when the blitter wants to write one long to memory.  The meta-assembler processes that into this 80x86 assembly code which is to be called by the blitter compilation engine:\par
\par
{\f22 \tab .align\tab 4,0x90\par
_xdblt_copy_short_narrow_1:\par
\tab movw\tab $0x8789,(%edi)\par
\tab movl\tab %eax,2(%edi)\par
\tab addl\tab $6,%edi\par
\tab ret\par
}\par
The subroutine that the meta-assembler has produced above, when executed, will generate the movl instruction (i.e. the movl instruction in the template) followed by its argument.   The meta-assembler has deduced that "movl" in the ex
ample template is 80x86 opcode 0x8789.   \par
\par
\page Let's take a look at a more complicated template.  This template handles the case where we want to bitwise OR a pattern to t
he destination bitmap, and the number of longs to transfer equals zero mod 4 (e.g. if the blitter wants to OR 36 longs to memory):\par
\par
{\f22 @meta or_short_narrow_many_mod_0\par
\tab addl\tab $@param_offset@,%edi\par
      movl\tab $@param_l_cnt_div_4@,%ecx\par
1:\tab orl\tab %eax,(%edi)\par
\tab orl\tab %eax,4(%edi)\par
\tab orl\tab %eax,8(%edi)\par
\tab orl\tab %eax,12(%edi)\par
\tab addl\tab $16,%edi\par
\tab decl\tab %ecx\par
\tab jnz\tab 1b\par
@lit\tab leal\tab (%eax,%edx,4),%ecx\par
@lit\tab addl\tab %ecx,edi_offset\par
@endmeta\par
}\par
The meta-assembler compiles that to this:\par
\par
{\f22 \tab .align\tab 4,0x90\par
_xdblt_or_short_narrow_many_mod_0:\par
\tab movw\tab $0xC781,(%edi)\par
\tab movl\tab %eax,2(%edi)\par
\tab movl\tab $0x47090709,11(%edi)\par
\tab movb\tab $0xB9,6(%edi)\par
\tab movl\tab $0x8470904,15(%edi)\par
\tab movl\tab $0x754910C7,23(%edi)\par
\tab movl\tab $0x830C4709,19(%edi)\par
\tab movb\tab $0xEF,27(%edi)\par
\tab movl\tab %edx,%ecx\par
\tab shrl\tab $2,%ecx\par
\tab movl\tab %ecx,7(%edi)\par
\tab addl\tab $28,%edi\par
\tab leal\tab (%eax,%edx,4),%ecx\par
\tab addl\tab %ecx,edi_offset\par
\tab ret\par
}\par
This mechanically generated subroutine generates the executable 80x86 binary for the "or_short_narrow_many_mod_0" template.  It gets called by the blitter compilation engine when it needs code to OR a bunch of longs to memory.  \par
\par
The output of the meta-assembler isn't meant for human consumption.  As such, the output contains a hodge-podge of magic numbers ({\f22 0x47090709, 0xB9, 0x8470904,} etc.).  These numbers are fixed machine code values corresponding to opcodes,
constant operands, and other values.\par
\par

Even though this subroutine is longer than the previous example, it still doesn't take very long to execute.  Furthermore, it only gets called when the blitter has determined that many longs are to be ORed to memory, so the time taken actually blitting to
memory will typically dwarf the time taken to execute these 15 code generation instructions.\par
\par
The meta-assembler is a Perl script that works by running numerous syntactically modified versions of the assembly template t
hrough "gas", the GNU assembler, and examining the output bytes to discover which bits are fixed opcode bits and which bits correspond to operands.  Once it has figured out what goes where, it generates 80x86 assembly code which writes out the constant byt
es and computes and writes out the operand bytes.  That code is run through a simple Pentium instruction scheduler and the meta-assembler is done.  This entire process is, of course, done only once, when Executor is compiled.\par
\pard\plain \s253\li360\sb120 \b\f16 A Portable Dynamically Recompiling Blitter\par
\pard\plain \f16\fs20 Although the meta-assembler-based blitter works only on 80x86 processors, Executor itself can run on non-Intel processors.  On other CPUs (such as the 68040 used in the NeXTstation) Executor's blitter works somewhat differently.
\par
\par

The basic idea is still the same: translate Region scanlines into an efficient form once and then use that efficient form each time the scanline gets drawn.  This time, however, the "efficient form" is processor independent, and the blitter is written enti
rely in C.\par
\par
As is the case with the 80x86-specific blitter, the portable blitter compilation
 engine examines scanline start/stop pairs and identifies which of several cases is appropriate.  One case is "output three longs", another is "output only certain pixels within the current long", and so on.\par
\par
Like the 80x86-specific blitter, the particular case encountered determines which entry in a lookup table will be used.  But there the similarity ends.  The lookup tables contain pointers to C code labels{\fs18\up6 \chftn {\footnote \pard\plain \s246
\f16\fs20 {\fs18\up6 \chftn }"What the heck is a pointer to a C code label?", you ask?  gcc (the GNU C compiler) has a "pointer to label" extension to the C language which makes the statement "{\f22 &&my_label"} evaluate to a\par
"{\f22 void *}" that points to the compiled code for "{\f22 my_label:}" within a C function. This, combined with gcc's "{\f22 goto void *}" extension, allows C programs to execute goto statements whose destinations are not known at compile time.}}
 rather than to routines that generates 80x86 code on the fly.\par
\par
Each scanline gets translated into an array of opcodes for the "blitter opcode interpreter" (which will be described below).  Each opcode is stored in one of these C structs:\par
\par
{\f22 struct\par
\{\par
  /* Pointer to C code to handle\par
     this opcode. */\par
  const void *label;\par
\par
  /* Offset into scanline */\par
  int32 offset;\tab \tab \par
\par
  /* Extra operand with\par
     different uses. */\par
  int32 arg;\tab \tab \par
\};\par
}\par
For example, consider the case where the blitter wants to write out five contiguous longs from a "simple" pattern starting 64 bytes into the current row.
In this case, "label" would equal "&&copy_short_narrow_many_5", "offset" would equal 64, and "arg" would equal 5.\par
\pard\plain \s253\li360\sb120 \b\f16 The Blitter Opcode Interpreter\par
\pard\plain \f16\fs20
The blitter opcode interpreter is machine generated C code created by a Perl script when Executor is compiled.  That Perl script takes as input C code snippets that tell it how to handle particular drawing modes, and produces as output C code for an interp
reter.\par
\par
Here is the template taken as input by the Perl script for the "copy_short_narrow" case.  This is the simple case where the pixels for the pattern being displayed can be stored entirely within one 32-bit long (for example, solid white or solid black).
\par
\par
{\f22 begin_mode cpy_shrt_narrow max_unwrap\par
 repeat\tab @dst@ = v;\par
 mask\tab       @dst@ = (@dst@ & ~arg)\par
\tab \tab \tab | (v & arg);\par
end_mode\par
}\par
The "{\f22 repeat}" field tells the Perl script what C code to generate for the simple case where all pixels in the destination long are to be affected.  The "mask" case tells it what to do when it must only modify certain bits in the target long and must
 leave others alone.  Max_unwrap tells the Perl script to unroll the new blitting loop.\par
\par
The generated interpreter takes as input an array of blitter opcode structs, which it then proceeds to interpret once for each row to be drawn.\par
\par

Here is the section of the (machine-generated) interpreter that handles the copy_short_narrow cases.  Remember that each "blitter opcode" is really just a pointer to one of these C labels.  This code would get used when filling a rectangle with a solid col
or.\par
\par
{\f22 copy_short_narrow_mask:\par
  *dst = (*dst & ~arg) | (v & arg);\par
  JUMP_TO_NEXT;\par
copy_short_narrow_many_loop:\par
  dst += 8;\par
copy_short_narrow_many_8:\par
  dst[0] = v;\par
copy_short_narrow_many_7:\par
  dst[1] = v;\par
copy_short_narrow_many_6:\par
  dst[2] = v;\par
copy_short_narrow_many_5:\par
  dst[3] = v;\par
copy_short_narrow_many_4:\par
  dst[4] = v;\par
copy_short_narrow_many_3:\par
  dst[5] = v;\par
copy_short_narrow_many_2:\par
  dst[6] = v;\par
copy_short_narrow_many_1:\par
  dst[7] = v;\par
  if ((arg -= 8) > 0)\par
    goto copy_short_narrow_many_loop;\par
  JUMP_TO_NEXT;\par
}\par
Note how the inner blitting loop is "unwrapped" for speed.  A blitter opcode would specify that 39 longs are to be output by making its "arg" field be 39 and the "label" field point to "copy_short_narrow_many_7", in the middle of the unwrapped loop
 (39 mod 8 equals 7).  The interpreter would jump there and loop until all of the pixels had been written out, at 32 bytes per loop iteration.  This is very fast, especially for portable code.\par
\par
Of course, if any other pixels needed to be drawn, there would be additional blitter opcode structs telling the interpreter what to do. The interpreter dispatches to the next opcode by executing the "JUMP_TO_NEXT" macro, which automatically uses GCC
's "goto void *" extension to "goto" the C label that handles the next opcode.\par
\pard\plain \s255\sb240 \b\f21\ul Development Tools\par
\pard\plain \s254\sb120 \b\f21 Free Software\par
\pard\plain \f16\fs20 It is true that ARDI has a very tight R&D budget, but we really don't skimp on the tools that we use to build Executor.  We use free software to develop Executor because we like to push the tools that we use very har
d and the only way we can do that and still sleep at night is when we know that if we find bugs in our tools
that they can be fixed quickly.  With free software the worst case is to fix bugs ourselves, and that worst case is actually much better than the average case with non-free software where you report a bug and pray for a patch.  In reality it's rare that we
 even have to resort to the worst case since bugs reported are often fixed in less than a day.\par
\pard\plain \s253\li360\sb120 \b\f16 GCC\par
\pard\plain \f16\fs20 GCC is the Free Software Foundation's C compiler.  It produces good code and has a powerful inline assembly syntax that allows optimization to be done on the expressions in
the inline assembly without the optimization ruining the assembly you've written.\par
\par
Another handy GCC extension is "{\f22 typeof}
" which can be used in macros to cast a value to the type of a different value.  The combination of powerful inline assembly and typeof allows us to have efficient macros that swap bytes in a 16 bit or 32 bit quantity.  Since the Mac and PC are of differe
nt endianness, quick byte swapping routines are very important.\par
\par
As mentioned above in our synthetic CPU and portable blitter descriptions, we also use GCC's ability to take the address of a label and store it in a variable so that we can produce our own threaded code on the fly.\par
\pard\plain \s253\li360\sb120 \b\f16 Hacked GCC\par
\pard\plain \f16\fs20 Because the source to GCC is available, it is possible, although not necessarily advisable, to hack in custom extensions.  At ARDI we've done this twice in the past.  At one time we used a set of locally written modificat
ions to support the pascal keyword so that we could automatically call functions using Pascal calling conventions.  At the same time we also supported '{\f22 1234'} (i.e.
 the ability to construct a 32-bit quantity out of four character constants inside apostrophes).    Eventually we decided that we didn't get enough benefit from these extensions to make it worth patching new versions of GCC as they came out.\par
\par
The other time we modified GCC was when we were porting Executor to DEC's Alpha processor.  We were doing this under OSF/1 which uses 64-bit pointers.  Since Executor needs to use the same internal representa
tion that Macs use, we wanted a way to easily write 32-bit pointers to memory in such a way that they would be extended to 64-bits when they were read into a register for use.  To do this we made GCC
 support "pointer bit fields", a logical extension that allowed bit-field notation to be used when specifying pointers.  At that time we didn't have a resident GCC
 expert, so we were lucky that such modifications basically consisted in taking out a few checks that disallowed such constructs.  Once those checks were removed, pointer bit-fields, "just worked".\par
\pard\plain \s253\li360\sb120 \b\f16 DJGPP\par
\pard\plain \f16\fs20 DJGPP is DJ Delorie's (see {\f22 http://www.delorie.com}) port of GCC to MSDOS.  It allows DOS users to compile UNIX programs under DOS and to run them with little or no modification.  DJGPP is GCC and associated development tools
 with a special UNIX like C-
library and a "DOS Extender".  DOS extenders are used to combat OS inferiority.  DOS is a 16-bit OS, whereas most relatively modern OSes are 32-bit.  DOS extenders allow 32-bit programs to run under DOS.  Executor is one such program.  In fact, we use the
djgpp libraries and DOS extender but we don't actually use the DOS port of GCC, because we don't like DOS.  We like Linux and GCC
 is well structured so we can do cross-compilation and cross-linking with the djgpp libraries and build our DOS product under Linux.  We completely compile the DOS version of Executor under Linux.  We then copy the
 new Executor binary to a DOS partition, reboot to DOS, test Executor and then get the heck out of DOS.  Time spent using Executor is more like a Mac than it is like DOS.\par
\pard\plain \s255\sb240 \b\f21\ul \page Debugging Tools\par
\pard\plain \f16\fs20 Internally we have many debugging tools to help us figure out why an application may die or misbehave under Executor.\par
\pard\plain \s254\sb120 \b\f21 More Free Software\par
\pard\plain \s253\li360\sb120 \b\f16 GDB in General\par
\pard\plain \f16\fs20 Almost all of our debugging is done under the GDB debugger.  As with GCC, we're not using GDB because it's the free debugger; we're using the free debugger because it's GDB.  GDB is quite powerful.\par
\par
Whenever we find that a given application fails under Executor, we try to reproduce the failure under Linux.  Debugging on a system that has complete memory protection and pre-emptive multi-tasking means that your system stays up even when your ap
plication crashes.  There's also no need to worry that when a program is misbehaving that it's subtly corrupting other programs on the system.\par
\pard\plain \s252\li360 \f16\ul hardware watch points\par
\pard\plain \f16\fs20 Beyond the features that are handed to us due to the underlying robustness of the OS, GDB also supports hardware watch points, at least on 80x86 based PCs.  "80x86"s
 have the ability to use hardware to watch a small set of memory locations to see when they change.  Since the checking is done by hardware, the program runs at full speed until the memory location is modified, at which point the debugger stops, tells us
 which instruction modified which memory address and what the old and new values are for that address.\par
\par
As an example, assume we want to know when the low-memory global {\f22 TheMenu} is changing, here is how it might look under GDB:\par
\par
{\f22 (gdb) watch TheMenu\par
Hardware watchpoint 1: TheMenu\par
(gdb) c\par
Continuing.\par
Hardware watchpoint 1: TheMenu\par
\par
Old value = 0\par
New value = 768\par
C_HiliteMenu (mid=3) at menu.c:877\par
(gdb) swap16 768\par
$2 = 0x3\par
(gdb) c\par
Continuing.\par
Hardware watchpoint 1: TheMenu\par
\par
Old value = 768\par
New value = 0\par
C_HiliteMenu (mid=0) at menu.c:877\par
(gdb) delete 3\par
(gdb) c\par
Continuing.\par
}\par
At the first {\f22 (gdb) }prompt above, we tell GDB that we want to be alerted whenever the expression "TheMenu" changes.  GDB is clever enough to realize that it can watch that expression
with a hardware watchpoint, so it assigns watchpoint 1 to the task.  We then continue, which allows Executor to continue running whatever program it was already running.{\fs18\up6 \chftn {\footnote \pard\plain \s246 \f16\fs20 {\fs18\up6 \chftn }
I actually set this watchpoint in the session of Executor that I am using to run Word 5.1 for the Macintosh to compose this document (Executor/Linux on a 90 MHz Pentium).}}\par
\par
Eventually, when the menu bar was accessed, GDB
 told us that TheMenu had changed from 0 to 768.  768 may sound like a weird value for TheMenu to take, but this is on a byte swapped machine, so we need to swap that 16-bit value to see what the TheMenu would look like to a Mac program and we find that it
's 3, a sane value for TheMenu, after all.  We let the program continue and later TheMenu is changed back to zero.\par
\par
You can't see it, but in another window the source to Executor is displayed so that we are automatically shown the 877th line of menu.c when GDB's watch point triggers there.\par
\par
The argument to the watch command is an arbitrary expression, so it is possible to watch for much more complex changes than our example demonstrated.  A
lthough only relatively simple watchpoints will be handled by hardware watchpoints, the others will be handled by software watchpoints which are much slower.\par
\pard\plain \s253\li360\sb120 \b\f16 \page Hacked GDB\par
\pard\plain \f16\fs20 Unlike GCC, where we made local modifications and then, upon reflection, threw them out, we have made a slight change to GDB that is a big win for debugging Executor (and Mac programs running under Executor) on PCs.  GDB
 always knows how to disassemble the object code that it's running, and GDB is available for many architectures, so we modified GDB so that on the 80x86 we can do both 80x86 disassembly and 680x0
 disassembly.  That allows us to look at sections of memory within our emulator and see what 680x0 code is there.\par
\par
In the example below, Executor is running the game Risk, when we interrupt Executor and then tell GDB to break in the routine alinehandler.  We then continue until alinehandler is hit.  We then disassemble, in 680x0 format,
 the first nine instructions at the location from which alinehandler was dispatched.  After doing that we disassemble in 80x86 format the first nine instructions of alinehandler itself.\par
\par
{\f22 (gdb) b alinehandler\par
Breakpoint 6 at 0x17ce2d: file executor.c, line 369.\par
(gdb) c\par
Continuing.\par
\par
Breakpoint 6, alinehandler (pc=3652006, ignored=0x0) at executor.c:369\par
(gdb) set m68k\par
(gdb) x/9i pc\par
0x37b9a6 :\tab _SystemTask\par
0x37b9a8 :\tab clrw sp@-\par
0x37b9aa :\tab movew #-1,sp@-\par
0x37b9ae :\tab pea a5@(-27598)\par
0x37b9b2 :\tab _GetNextEvent\par
0x37b9b4 :\tab moveb sp@+,d0\par
0x37b9b6 :\tab tstb d0\par
0x37b9b8 :\tab beqw 0x37ba0e <end+667542>\par
0x37b9bc :\tab movew a5@(-27598),d0\par
(gdb) set m68k off\par
(gdb) x/9i alinehandler\par
<alinehandler>:         pushl  %ebp\par
<alinehandler+1>:\tab movl   %esp,%ebp\par
<alinehandler+3>:\tab subl   $0x28,%esp\par
<alinehandler+6>:\tab pushl  %esi\par
<alinehandler+7>:\tab pushl  %ebx\par
<alinehandler+8>:\tab jmp    0x17ce10 <alinehandler+48>\par
<alinehandler+10>:\tab nop    \par
<alinehandler+11>:\tab nop    \par
<alinehandler+12>:\tab nop}    \par
\par
Being able to disassemble 680x0 code on the 80x86 required us to change approximately 50 source lines of GDB (remember, the 680x0 disassembly code was already present for use in GDB on 680x0 machines).  We also added a set of tables so that a-line traps
and low-memory globals are displayed by name, rather than by number.\par
\par
Although our special circumstances led us to modify the GDB source code, GDB is customizable out of the box.  We've defined a handful of macros that automate debugging tasks.  Figure 6 is a macro that crawls through the stack in mac space.\par
\par
For comparison, Figure 7 is what GDB produces when backtracking code that is compiled with GDB debugging symbols.\par
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 \f16\fs20 {\f22 define macktrace\par
  set $_fp = cpu_state.regs[14].ul.n + 0\par
  silentswap32 (((uint32*)$_fp)[1]+0)\par
  set $_pc = $_val + 0\par
  silentswap32 (((uint32*)$_fp)[0]+0)\par
  set $_fp = $_val + 0\par
  while $_fp > 100 && $_fp < 30000000\par
    set $_start = (long) $_pc + 0\par
    while $_start > (long)&end && *(uint16 *)$_start != 0x564E\par
\tab set $_start = $_start - 2\par
    end\par
    printf "func=0x%lX, ret=0x%lX, fp=0x%lX, args=0x%02X%02X%02X%02X 0x%02X%02X%02X%02X 0x%02X%02X%02X%02X\\n",\\\par
\tab $_start, $_pc, $_fp,\\\par
\tab ((uint8 *)$_fp)[8], ((uint8 *)$_fp)[9], ((uint8 *)$_fp)[10],\\\par
\tab ((uint8 *)$_fp)[11], ((uint8 *)$_fp)[12], ((uint8 *)$_fp)[13],\\\par
\tab ((uint8 *)$_fp)[14], ((uint8 *)$_fp)[15], ((uint8 *)$_fp)[16],\\\par
\tab ((uint8 *)$_fp)[17], ((uint8 *)$_fp)[18], ((uint8 *)$_fp)[19]\par
    silentswap32 ((uint32*)$_fp)[1]+0\par
    set $_pc = $_val + 0\par
    silentswap32 ((uint32*)$_fp)[0]+0\par
    set $_fp = $_val + 0\par
  end\par
end\par
(gdb) macktrace\par
func=0x3824F8, ret=0x38250E, fp=0xB28E3C, args=0x00B2E852 0x000300B2 0x8E580037\par
func=0x37A2BE, ret=0x37A3A6, fp=0xB28E4A, args=0x0037BA12 0x000000B2 0x8F840037\par
func=0x37AECE, ret=0x37AFF0, fp=0xB28E58, args=0x0001002E 0xE0BC0000 0x00010035\par
func=0x379D58, ret=0x379E0C, fp=0xB28F84, args=0x000100B2 0x8F9200B2 0x8F9A0000\par
}\pard \qc\box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\plain \f16 \par
Figure 6.  Macktrace Definition and Example\par
}{\f22 \par
}\pard \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\f22 (gdb) backtrace\par
#0  C_SysBeep (i=10) at osutil.c:837\par
#1  0x18934d in PascalToCCall (ignoreme=2271560241, infop=0x29faa4)\par
    at emutrap.c:94\par
#2  0x17d0c9 in alinehandler (pc=3661160, ignored=0x0)\par
    at executor.c:399\par
#3  0x1c1b85 in trap_direct (trap_number=10, exception_pc=3661160, \par
    exception_address=0) at trap.c:201\par
#4  0x197cfc in S68K_HANDLE_0x00B5 () at syn68k.c:1038\par
#5  0x196067 in interpret_code (start_code=0x2df6c4) at syn68k.c:587\par
#6  0x12d476 in beginexecutingat (startpc=11730018)\par
    at launch.c:328\par
#7  0x12e1ce in launchchain (fName=0x2b53f8 "\\004Risk", vRefNum=-32717, \par
    resetmemory=1 '\\001') at launch.c:575\par
#8  0x12f6e0 in Launch (\par
    fName_arg=0x910 "\\004Riskutor", '\'ff' <repeats 27 times>, vRefNum_arg=-32717)\par
    at launch.c:1142\par
#9  0x17e1f7 in executor_main ()\par
    at executor.c:589\par
#10 0x13371a in main (argc=2, argv=0xbffffa04)\par
    at main.c:2112}\par
\pard \qc\box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\plain \f16 Figure 7.  GDB backtrace}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 \page \page
As you might guess, this disparity of information makes it much easier for for us to track down bugs in our own code then finding bizarre incompatibilities in the code that is being run under the emulator.\par
\pard\plain \s254\sb120 \b\f21 Disassembler\par
\pard\plain \f16\fs20 Since GDB already knows how to disassemble 680x0 code it was possible to write a driver for GDB
 so that it can disassemble Mac programs.  The driver is about 1,000 lines of C code, with another 500 lines describing the low-memory globals.  Basically the driver knows about CODE resources and how intersegment jumps work.  GDB
 normally doesn't produce labels for jump targets or the beginning of subroutines, so the driver adds those too, to make the output that much easier to read.\'13\par
\pard\plain \s254\sb120 \b\f21 Run-time Aids\par
\pard\plain \f16\fs20 Because we're using our own set of OS and Toolbox routines, we can add code that is conditionally compiled into debug versions of Executor that can provide still more information than GDB or GDB macros can.\par
\pard\plain \s253\li360\sb120 \b\f16 Debugtable, Debugnumber\par
\pard\plain \f16\fs20 Our A-line trap handler has a table, known as debugtable, of 4096 32-bit ints that it updates each time a trap is taken.  Each time alinehandler is called, a variable known as "debugnumber" is incremented and the
n the value of debugnumber is stored in the slot in debugtable corresponding to the aline trap that was called.  This allows us to see both what traps were recently executed and a complete list of every trap that an application makes
, no matter how long the application has run.\par
\par
\pard This scheme has its drawbacks.  Traps that are dispatched via selectors are all lumped together.  Traps whose addresses are taken and then are called by jumps through the address don't show up in debugtable.
Although debugtable and debugnumber are perhaps the least
sophisticated portion of Executor, they're still quite handy, since a visual inspection of the last 100 traps made before an application died often gives a good idea of where to start looking for the source of the incompatibility.\par
\pard\plain \s253\li360\sb120 \b\f16 XX_slam\par
\pard\plain \f16\fs20 In the course of developing Executor, we did a major rewrite of our memory manager and our TextEdit
 replacement.  In both cases it's not enough to just implement the APIs that are defined in Inside Macintosh, we also have to duplicate the in-memory data structures so that programs which count on them will run properly.
  To help us verify that we weren't adding new bugs when we rewrote those subsystems we added routines that would consistency check the data structures that each of those subsystems support.\par
\par
\pard Because these consistency checks are thorough but time consuming, we call them "slams", and by default they are not enabled, even in debugging versions of Executor.
When they are enabled, the data structures for each subsystem are slammed at the entry to a call that might modify one of the data structures and the data structure is slammed once again on exit of the routine.
We can turn them on at run-time either by using a command line option when Executor is started or by using GDB to enable the slamming.  This is something we should have done for all of Executor's subsystems from day one, since it
's ever so helpful to be told that going into routine XXX, the heap was fine, but coming out the heap was corrupted.\par
\pard\plain \s253\li360\sb120 \b\f16 Image Viewer\par
\pard\plain \f16\fs20
Reading disassembled code is much easier than staring at hex numbers.  Similarly, being able to view a portion of memory as some sort of PixMap (assuming that the memory really is a bit image) is also better than staring at a bunch of hex numbers.  When we
 build Executor for X-Windows, we also build an image server that uses UNIX interprocess communication to communicate with the process being debugged under GDB.  This allows us to monitor offscreen graphics, which can be very important when an application
 makes many graphics calls and eventually an abomination is drawn on the screen instead of what should have been drawn.\par
\par
\pard Our debugging arsenal includes other, more prosaic, tools.  In fact, our debugging environment encourages the development of new tools, because it's so easy to leverage existing tools into new tools and even write new tools from scratch.\par
\pard\plain \s255\sb240 \b\f21\ul \page Future Plans\par
\pard\plain \f16\fs20 Much of VCPU, a successor to Syn68k, has already been written.  VCPU performs many optimizations that
 Syn68k does not, including improved register allocation, dead subregister elimination, opcode "widening", and moving work outside of loops.  VCPU has a clean high-level syntax for specifying both front ends and back ends, allowing it to dynamically compil
e both PowerPC and m68k binaries on any architecture we decide to support.\par
\par
Although we don't explicitly mention it, the graphic subsyste
m one layer above the blitter already has hooks in it to allow use of graphics accelerators, where present.  We plan a native port to Win32 and OS/2 and those ports should be able to use fancier
graphic subsystems and also make use of the underlying network APIs.\par
\par
Currently INITs and CDEVs do not run under Executor, but the same mechanisms that allow applications to run can also allow INITs and CDEVs to run.  QuickTime and ATM will both be high priorities after Executor 2 ships.\par
\par
We will also be developing compiler tools that will allow ISVs to natively compile CPU specific routines to be used when their applications are run under Executor.  Executor already uses such gateways internally.\par
\par
Already, multiple simultaneous instances of Executor can be run under NEXTSTEP and Linux (and to a lesser extent under Windows '95).  Current
ly only Executor/NEXTSTEP handles PICT pasteboard cutting and pasting from one instantiation of Executor to another, and no versions of Executor do enough file locking to allow concurrent access of the same HFS volumes at once.  This needs to be fixed
, since either through shared text segments under UNIX and UNIX like operating systems or through DLLs under Microsoft operating systems, it can be made fairly efficient to run multiple instances of Executor simultaneously
.  When that is done, each instance of Executor has its own address space and is automatically scheduled by the underlying operating system scheduler.  That mean
s that Executor "inherits" memory-protection and pre-emptive multi-tasking from the underlying core operating system.\par
\par
By properly exploiting this inheritance it should be possible to provide an environment that allows well-behaved Mac applications to run efficiently under a variety of PC operating systems with automatic protection from non-well-behaved applications.
\par
\par
One interesting variant on this theme would be to use Linux as the core OS, but to hide it from the end-user, for a net result of an 80x86 box that boots an efficient, robust MacOS-like environment.  \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere
\pard\plain \f16\fs20 {\plain \f16 \par
}}