mirror of
https://github.com/ctm/executor.git
synced 2025-01-11 23:29:54 +00:00
1 line
61 KiB
Plaintext
1 line
61 KiB
Plaintext
{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f16\fnil Palatino;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}}{\colortbl\red0\green0\blue0;
|
|
\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\tqc\tx4320\tqr\tx8640 \f16\fs20 \sbasedon0\snext243 footer;}{\s245
|
|
\f16\fs18\up6 \sbasedon0\snext0 footnote reference;}{\s246 \f16\fs20 \sbasedon0\snext246 footnote text;}{\s250\li720 \f16\fs20\ul \sbasedon0\snext0 heading 6;}{\s251\li720 \b\f16\fs20 \sbasedon0\snext0 heading 5;}{\s252\li360 \f16\ul \sbasedon0\snext0
|
|
heading 4;}{\s253\li360\sb120 \b\f16 \sbasedon0\snext0 heading 3;}{\s254\sb120 \b\f21 \sbasedon0\snext0 heading 2;}{\s255\sb240 \b\f21\ul \sbasedon0\snext0 heading 1;}{\f16\fs20 \sbasedon222\snext0 Normal;}}{\info{\title Executor Internals}{\subject
|
|
How to efficiently run Mac Programs on PCs}{\author ctm + mat}}\margl720\margr720\margt-1440\margb-1440\gutter1080\widowctrl\ftnbj\fracwidth\margmirror \sectd \sbknone\linemod0\linex0\cols1\endnhere {\footer \pard\plain \s243\tx4320\tqr\tx8640 \f16\fs20
|
|
\tab Executor Internals\tab \chpgn \par
|
|
}\pard\plain \qc \f16\fs20 {\fs36 Executor Internals:\par
|
|
How to Efficiently Run Mac Programs on PCs\par
|
|
\par
|
|
}{\plain \f16 Mathew J. Hostetter }{\plain \f22 <mat@ardi.com>}{\plain \f16 \par
|
|
Clifford T. Matthews }{\plain \f22 <ctm@ardi.com>\par
|
|
}{\fs18 After MacHack '96, this paper will be available from}{\f22\fs18 http://www.ardi.com}{\fs18 \par
|
|
}{\plain \f16 \par
|
|
}\pard Executor is a commercial Macintosh emulator that uses no software from Apple, but is still able to run much 680x0 based Macintosh software faster on Pentiums than the same software runs on 680x0
|
|
based Macs. This paper contains some implementation details, including descriptions of Executor's synthetic CPU, graphics subsystem and debugging environment. Portability issues, current limitations and future plans are also presented.{\fs36 \par
|
|
}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \s255\sb240 \b\f21\ul Executor Overview\par
|
|
\pard\plain \s254\sb120 \b\f21 What Executor is\par
|
|
\pard\plain \f16\fs20 Executor is a commercial emulator that allows PCs to run many Macintosh applications. Executor does not require Macintosh ROMs or a Macintosh System file and contains no Appple code it
|
|
self. Executor was written entiredly by engineers without Macintosh backgrounds who have not disassembled any of Apple's ROMs or System file.\par
|
|
\pard\plain \s254\sb120 \b\f21 Limitations\par
|
|
\pard\plain \f16\fs20
|
|
Because Executor was written strictly from publicly available documentation (Inside Macintosh, Tech. Notes, etc.), programs which make use of undocumented features of MacOS may fail under Executor. Furthermore, there are some portions of MacOS that we ha
|
|
ven't implemented yet. Executor is sufficiently large that there are probably bugs in some of o
|
|
ur code as well. We realize these are major limitations, but this paper is primarily concerned with implementation details that are interesting to our fellow programmers as opposed to feature sets and limitations which are of more concern to end users and
|
|
our marketing department.\par
|
|
\pard\plain \s255\sb240 \b\f21\ul Design Goals\par
|
|
\pard\plain \f16\fs20 Our goal is for Executor to be accurate, fast and portable. Beyond that, completeness is a secondary issue.\par
|
|
\par
|
|
Accuracy means that each subsystem that we implement should behave exactly according to the functional specs for the subsystem that we've derived from a combination of reading documentation, writing test cases and running programs under Executor.\par
|
|
\par
|
|
|
|
Fast is harder to qualify. As programmers we like to use advanced techniques that will result in programs running under Executor as quickly as possible. Unfortunately, we have a limited number of engineer hours in a week and most engineering time is spen
|
|
t implementing new subsystems or finding and fixing subtle incompatibilities. We're proud of the speed that we've obtained so far, but we know that we can do better in the future.\par
|
|
\par
|
|
Portability is the ability to support multiple platforms from the same source base. A platform is a combination of CPU, operating system and graphics device or windowing system. Executor currently supports Intel 80[3456]
|
|
86 and compatible CPUs, Motorola m680[34]0 CPUs, the operating systems DOS, Linux and NEXTSTEP and can interact with VGA, SVGA, Display PostScript and X-Windows. To get the best performance on some architectures we do use arc
|
|
hitecture specific code, but we also write portable versions to be used where the platform specific versions can't be. Although not supported as a product, Executor was ported to DEC's Alpha, but since ARDI has no Alpha and DEC lost interest, the A
|
|
lpha port is no longer current. Although not recently, ROMlib, ARDI's rewrite of the MacOS OS and Toolbox routines, has been ported to a wide variety of platforms, including MIPS , m88k , Clipper, IBM RT, SPARC and even VAX based systems.\par
|
|
\par
|
|
|
|
Those three design goals have led us in the direction of dynamic code generation for both the 680x0 emulation and for our blitter. In both cases we use high level descriptions of what we want accomplished and then use special purpose tools at compile time
|
|
to translate these high level descriptions into constructs that we can then use at run time.\par
|
|
\par
|
|
High level descriptions are less error p
|
|
rone, allowing us to document the semantics that we wish to see in our synthetic CPU or blitter using a special purpose language that is directly suited to the task at hand, rather than a general purpose language like C or the traditional language of speed
|
|
freaks -- assembler.\par
|
|
\par
|
|
High level descriptions also lend themselves to portability. We have our tools generate portable constructs for the general case and, with a little more programming effort, faster architecture specific constructs for the architec
|
|
tures that we consider most important.\par
|
|
\par
|
|
Since the conversion from high level description to useful construct takes place at compile time, there is no need to worry about the CPU cycles spent doing the mapping. This allows us to design our code by thinking: "At {\i runtime}
|
|
, what would be the optimal instruction sequence to perform a specific task?" Once we know the answer to that question we can ask: "How can we represent at a high level, the task is being accomplished by that optimal set of instructions?".
|
|
Then, the final question is "Given what we want to generate and how we want to represent it, what does the compile time mapping look like?". The entire time we're pondering those three questions, we're keeping accuracy, portability and efficiency in min
|
|
d.\par
|
|
\pard\plain \s255\sb240 \b\f21\ul Executor Subsystems\par
|
|
\pard\plain \s254\sb120 \b\f21 Synthetic CPU\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Overview\par
|
|
\pard\plain \f16\fs20 Syn68k is the name of the synthetic CPU that Executor 2 uses. Syn68k is both highly portable and fast. The portable core of Syn68k, which works by dynamically compiling 680x0 code into an efficient interprete
|
|
d form, was designed to run on all major CPU's. On supported architectures, Syn68k can also translate 680x0 code into native code that the host processor can run directly.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Syngen\par
|
|
\pard\plain \f16\fs20 Syngen analyzes a lisp-like file describing the bit patterns and semantics of the 680x0 instruction set and produces lookup tables and C code for the runtime system to use
|
|
. The code and tables generated by syngen depend somewhat on the characteristics of the host processor; for example, on a little endian machine it is advantageous to byte swap some extracted 680x0 operands at translation time instead of at runtime.\par
|
|
\par
|
|
The 680x0 description file can describe multiple ways to emulate any particular 680x0 opcode. The runtime system looks at what CC bits are live after the instruction and chooses the fastest variant it can legally use. In Figure 1
|
|
, we have two CC variants of lsrw; one computes no CC bits, and the other computes all of them.\par
|
|
\par
|
|
The 680x0 description file can also specify which 680x0 operands should be "expanded" to become implicitly known by the corresponding synthetic opcode. For example, fully expanding out "addl dx,dy" would result in 64 synthetic opcodes, one \par
|
|
for each combination of data register operands. This results in smaller and faster synthetic opcodes at the expense of increasing the total number of synthetic opcodes. To conserve space, we only "expand out "
|
|
common 680x0 opcodes. On host architectures where we can compile to native code, we don't waste space by "expanding out" common synthetic opcodes.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Interpreted Code\par
|
|
\pard\plain \f16\fs20 Our interpreted code consists of contiguous sequences of "synthetic opcodes" and their operands. Syngen can generate ANSI C, but when \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs
|
|
\phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 (defopcode lsrw_ea\par
|
|
(list 68000 amode_alterable_memory () (list "1110001011mmmmmm"))\par
|
|
(list "-----" "-----" dont_expand\par
|
|
\tab (assign $1.muw (>> $1.muw 1)))\par
|
|
(list "CN0XZ" "-----" dont_expand\par
|
|
\tab (list\par
|
|
\tab (assign ccx (assign ccc (& $1.muw 1)))\par
|
|
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\f22 \tab (ASSIGN_NNZ_WORD (assign $1.muw (>> $1.muw 1))))))\par
|
|
\par
|
|
}{\plain \f16 Figure 1. Syn68k description of lsrw}{\f22 \par
|
|
}\pard \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 compiled with GCC it uses C language extensions that make synthetic opcodes be
|
|
pointers to the C code responsible for interpreting that opcode. This "threaded interpreting" entirely eliminates switch dispatch and loop overhead.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Native Code\par
|
|
\pard\plain \f16\fs20 For the 80x86 architecture, Syn68k supports an optional architecture-specific native code extension that
|
|
tries to generate native code whenever possible. In those rare cases when it cannot, it reverts to our interpreted code. Since Syn68k
|
|
supports both native and synthetic code, the runtime system automatically inserts gateways between the two whenever there is a transition. \par
|
|
\par
|
|
Three major problems make translating 680x0 code to 80x86 code difficult:\par
|
|
\par
|
|
\pard \fi-180\li540 \bullet The 80x86 has only 8 registers, while the 680x0 has 16.\par
|
|
\par
|
|
\bullet The 80x86 is little endian, while the 680x0 is big endian.\par
|
|
\par
|
|
\bullet The 80x86 does not have general-purpose postincrement and predecrement operators, which are used frequently in 680x0 code.\par
|
|
\pard \par
|
|
On the other hand, several factors make the job easier than it would be for RISC machines:\par
|
|
\par
|
|
\pard \fi-180\li540 \bullet The 80x86 has all of the CISC addressing modes commonly used in 680x0 code.\par
|
|
\par
|
|
\bullet The 80x86 has CC bits that map directly to their 680x0 counterparts (except for the 680x0's X bit).\par
|
|
\par
|
|
\bullet The 80x86 supports 8-, 16- and 32-bit operations, (although it can only support 8 bit operations on four of its registers).\par
|
|
\par
|
|
\bullet The 80x86 and 680x0 have analogous conditional branch instructions.\par
|
|
\par
|
|
\bullet The 80x86 allows unaligned memory accesses without substantial overhead.\par
|
|
\pard \par
|
|
|
|
The toughest problem is the lack of registers. On 32-register RISC architectures it's easy to allocate one RISC register for each 680x0 register, but on the 80x86 a different approach is needed. The obvious solution is to perform full-blown inter-block re
|
|
gister allocation, but we fear that using traditional compiler techniques would be unacceptably slow.\par
|
|
\par
|
|
For now, we have adopted a simple constraint: between basic blocks, all registers and live CC bits must reside in their canonical home in memory. Within a block, anything goes. So what liberties does Syn68k take within a block?\par
|
|
\par
|
|
|
|
The 80x86 register set is treated as a cache for recently used 680x0 registers, and the 80x86 CC bits are used as a cache for the 680x0 CC bits. At any particular point within a block, each 680x0 register is either sitting in its memory home or is cached
|
|
in an 80x86 register, and each live 680x0 CC bit
|
|
is either cached in its 80x86 equivalent or stored in its memory home. Cached registers may be in canonical form, may be byte swapped, may have only their low two bytes swapped, or may be offset by a known constant from their actual value.\par
|
|
\par
|
|
Each 680x0 instruction can require that 680x0 registers be cached in particular ways. For example, {\f22 movel d0, mem} requires d0 to be cached in big endian byte order. T
|
|
he compilation engine generates the minimal code needed to satisfy those constraints and then calls a sequence of routines to generate the native code. As each 680x0 instruction is
|
|
processed, each 680x0 register's cache status is updated. Dirty registers are canonicalized and spilled back to memory at the end of each block (or when we run out of 80x86 registers and we need to make room).\par
|
|
\par
|
|
|
|
We allow 680x0 registers to be cached with varying byte orders and offsets so that we can perform the optimizations of lazy byte swapping and lazy constant offsetting. If the 680x0 program loads a register from memory and then ends up writing it out later
|
|
, we avoid unnecessary byte swaps by not canonicalizing the value immediately. Lazy constant offsetting mitigates \par
|
|
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 \f16\fs20 {\f22 \tab pea\tab \tab 0x1\par
|
|
\tab pea\tab \tab 0x2\par
|
|
\tab pea\tab \tab 0x3\par
|
|
\tab pea\tab \tab 0x4\par
|
|
}\tab ...\par
|
|
\par
|
|
becomes this 80x86 code:\par
|
|
\par
|
|
{\f22 \tab movl\tab _a7,%edi\par
|
|
\tab movl\tab $0x01000000,-4(%edi)\tab ; "push" big-endian constant\par
|
|
\tab movl\tab $0x02000000,-8(%edi)\par
|
|
\tab movl\tab $0x03000000,-12(%edi)\par
|
|
\tab movl\tab $0x04000000,-16(%edi)\par
|
|
\tab ... <more uses of a7 may follow, and they'll use %edi>\par
|
|
\tab subl\tab $16,%edi\par
|
|
\tab movl\tab $edi,_a7\par
|
|
}\tab ...\par
|
|
\pard \qc\box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 {\plain \f16 Figure 2. Lazy Constant Offsetting\par
|
|
}\pard \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 the overhead of postincrement and predecrement side effects. Figure 2 is an example of lazy constant offsetting.\par
|
|
\par
|
|
|
|
As mentioned above, we use the 80x86 condition code bits as a cache for the real 680x0 CC bits. Although live cached CC bits are occasionally spilled back to memory because some 80x86 instruction is about to clobber them, this trick almost always works.
|
|
Using 80x86 CC bits, we can frequently get away with extremely concise code sequences; for example, a 680x0 compare and conditional branch becomes an 80x86 compare and conditional branch.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Self-modifying Code\par
|
|
\pard\plain \f16\fs20 Like most dynamically compiling emulators, Syn68k
|
|
doesn't detect self-modifying code; the overhead is too high. Fortunately, self-modifying programs don't work on the real 68040 either. We rely on the program making explicit system calls to flush the caches whenever 680x0 code may have been modified or
|
|
created. Some programs (like HyperCard) flush the caches very often, wh
|
|
ich can cause real performance headaches if code is continuously recompiled. We have solved this problem by checksumming 680x0 blocks as they are compiled and only decompiling blocks which fail their checksums. This optimization alone sped up some HyperC
|
|
ard stacks by a factor of three or so.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Examples\par
|
|
\pard\plain \f16\fs20 Figure 3 contains two sample 680x0 code sequences from real applications, and the 80x86 code that Syn68k generates for them. We chose these code sequences specifically to showcase several of the techniques we use, so
|
|
you shouldn't use them as a substitute for benchmarks. Not all 680x0 code translates as well as these examples do, but these examples are far from exotic.\par
|
|
\pard\plain \s254\sb120 \b\f21 \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 \f16\fs20 Example 1 (Solarian):\par
|
|
\par
|
|
680x0 code:\par
|
|
\par
|
|
{\f22 \tab addqb\tab #1,a4@(1)\par
|
|
\tab movel\tab #0,d0\par
|
|
\tab moveb\tab a4@,d0\par
|
|
\tab swap\tab d0\par
|
|
\tab clrw\tab d0\par
|
|
\tab swap\tab d0\par
|
|
\tab asll\tab #2,d0\par
|
|
\tab lea\tab a5@(-13462),a0\par
|
|
\tab addal\tab d0,a0\par
|
|
\tab moveal\tab a0@,a0\par
|
|
\tab movel\tab #0,d0\par
|
|
\tab moveb\tab a4@(1),d0\par
|
|
\tab cmpw\tab a0@,d0\par
|
|
\tab bcs\tab 0x3fffee2}\par
|
|
\par
|
|
\par
|
|
80x86 code:\par
|
|
\par
|
|
{\f22 \tab movl\tab _a4,%edi\tab \tab ; addqb #1,a4@(1)\par
|
|
\tab addb\tab $0x1,0x1(%edi)\par
|
|
\tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d0\par
|
|
\tab movb\tab (%edi),%bl\tab \tab ; moveb a4@,d0\par
|
|
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d0\par
|
|
\tab xorw\tab %bx,%bx\tab \tab ; clrw d0\par
|
|
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d0\par
|
|
\tab shll\tab $0x2,%ebx\tab \tab ; asll #2,d0\par
|
|
\tab movl\tab _a5,%esi\tab \tab ; lea a5@(-13462),a0\par
|
|
\tab leal\tab 0xffffcb6a(%esi),%edx\par
|
|
\tab addl\tab %ebx,%edx\tab \tab ; addal d0,a0\par
|
|
\tab movl\tab (%edx),%edx\tab \tab ; moveal a0@,a0\par
|
|
\tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d0\par
|
|
\tab movb\tab 0x1(%edi),%bl\tab ; moveb a4@(1),d0\par
|
|
\tab bswap\tab %edx\tab \tab \tab ; cmpw a0@,d0\par
|
|
\tab movw\tab (%edx),%cx\par
|
|
\tab rorw\tab $0x8,%cx\par
|
|
\tab cmpw\tab %cx,%bx\par
|
|
\tab movl\tab %edx,_a0\tab \tab ; <spill dirty 68k\par
|
|
\tab movl\tab %ebx,_d0\tab \tab ; registers back to memory>\par
|
|
\tab jb\tab 0x6fae0c\tab \tab ; bcs 0x3fffee2\par
|
|
\tab jmp\tab 0x6faf0c\tab \tab ; <go to "fall through" code>}\par
|
|
\pard \box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 \page Example 2 (PageMaker):\par
|
|
\par
|
|
680x0 code:\par
|
|
\par
|
|
{\f22 \tab movel\tab #0,d2\par
|
|
\tab moveb\tab d0,d2\par
|
|
\tab lslw\tab #8,d0\par
|
|
\tab orw\tab d0,d2\par
|
|
\tab movel\tab d2,d0\par
|
|
\tab swap\tab d2\par
|
|
\tab orl\tab d2,d0\par
|
|
\tab movel\tab a0,d2\par
|
|
\tab lsrb\tab #1,d2\par
|
|
\tab bcc\tab 0x3fffed4}\par
|
|
\par
|
|
80x86 code:\par
|
|
\par
|
|
{\f22 \tab xorl\tab %ebx,%ebx\tab \tab ; movel #0,d2\par
|
|
\tab movl\tab _d0,%edx\tab \tab ; moveb d0,d2\par
|
|
\tab movb\tab %dl,%bl\par
|
|
\tab shlw\tab $0x8,%dx\tab \tab ; lslw #8,d0\par
|
|
\tab orw\tab %dx,%bx\tab \tab ; orw d0,d2\par
|
|
\tab movl\tab %ebx,%edx\tab \tab ; movel d2,d0\par
|
|
\tab rorl\tab $0x10,%ebx\tab \tab ; swap d2\par
|
|
\tab orl\tab %ebx,%edx\tab \tab ; orl d2,d0\par
|
|
\tab movl\tab _a0,%ecx\tab \tab ; movel a0,d2\par
|
|
\tab movl\tab %ecx,%ebx\par
|
|
\tab shrb\tab %bl\tab \tab \tab ; lsrb #1,d2\par
|
|
\tab movl\tab %ebx,_d2\tab \tab ; <spill dirty 68k\par
|
|
\tab movl\tab %edx,_d0\tab \tab ; registers back to memory>\par
|
|
\tab jae\tab 0x3b734c\tab \tab ; bcc 0x3fffed4\par
|
|
\tab jmp\tab 0x43d48c\tab \tab ; <go to "fall through" 68k code>\par
|
|
\par
|
|
}\pard \qc\box\brdrs \phmrg\posxc\posyt\dxfrtext180\shading1000 {\plain \f16 Figure 3. 680x0 -> 80x86 examples\par
|
|
}\pard\plain \s254\sb120 \b\f21 \sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \s254\sb120 \b\f21 Graphics\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 SVGA Graphics\par
|
|
\pard\plain \f16\fs20 The DOS world is one of standards. {\i Many}
|
|
standards. Standards made by engineers who were even more short-sighted than the folks who brought you ROM85, only to be replaced by SysEnvirons which was then replaced by Gestalt. The first color graphics adapter for the PC (CGA) was replaced with EGA,
|
|
which was then replaced by VGA, which eventually gave way to several different Super Video Graphics Array (SVGA) cards.\par
|
|
\par
|
|
|
|
SVGA cards have a couple of properties that make them less than perfect targets for the output of Macintosh emulators. First, the default is for SVGA's video memory to only be mapped into the PC address space through a 64k window (or bank). If you want t
|
|
o display 640x480x8 bits you need to write 64k of information to the 64k screen address range, then tell the video card that you want that same address to
|
|
represent a different 64k chunk of the screen, then you write to that address range again, then you switch banks again, and so forth.\par
|
|
\par
|
|
The second major complication is that under DPMI, the address space that contains the SVGA video memory is not in the same address space
|
|
that a 32-bit application uses. For those of you used to programming in a flat address space, it might be hard to believe that you need special machine language address space overriding prefixes to access screen memory, but
|
|
under DPMI 0.9 (which is the version of DPMI that Microsoft supports; we wouldn't have to do this under 1.0) "selector" overrides really are necessary.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 \page Blitter Overview\par
|
|
\pard\plain \f16\fs20
|
|
A Region is a data structure that describes a set of pixels. Regions can be created by the application by calling various MacOS toolbox routines. In addition the toolbox routines themselves sometimes create Regions for their own purposes. \par
|
|
\par
|
|
A blitter is a set of software or hardware which takes sets of bits, representing pixels,
|
|
and combines them with other sets of bits in a variety of different ways. A Region blitter is a blitter that processes pixels by Regions (rather than by rectangles or rectangle lists).\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 A Simple Blitter \par
|
|
\pard\plain \f16\fs20 One way to write a simple Region blitter is to start with a s
|
|
ubroutine that parses the start/stop pairs of a Region scanline and draws the corresponding pixels. This subroutine is then called once for each row of pixels to be displayed.\par
|
|
\par
|
|
|
|
Unfortunately, this approach is slow since each scanline gets re-parsed every time it is drawn. The Region for a 300 pixel tall rectangle consists of a single scanline with a repeat count of "300"; this "simple Region blitter" will parse that scanline 300
|
|
times! That's a lot of redundant work.\par
|
|
\par
|
|
There are many possible ways to get a
|
|
way with parsing each scanline only once. One approach is to convert the start/stop pairs into a bit mask where the bits in the mask correspond to the bits in the target bitmap that are to be changed. The inner blitting loop then becomes an exercise in b
|
|
itwise arithmetic. In C, such a loop might look something like this:{\f22 \par
|
|
\par
|
|
for (x = left; x < right; x++)\par
|
|
dst[x] = (dst[x] & ~mask[x]) \par
|
|
\tab | (pattern_value & mask[x]);\par
|
|
\par
|
|
}That's not bad, but we can do better.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 A Dynamically Recompiling Blitter \par
|
|
\pard\plain \f16\fs20 Using an explicit bit mask array is unnecessarily slow in the comm
|
|
on case of filling a rectangle. For a rectangular Region, mask[x] is usually all one bits, making the bit munging a waste of time. And even when the masks are never solid (e.g. when drawing a thin vertical line), this technique is still unnecessarily slo
|
|
w. As it turns out, even the cycles the CPU spends loading mask bits from memory are unnecessary. Furthermore, even if we were satisfied with the level of performance that C code like the above provides, we couldn't use it on a stock SVGA system because
|
|
it wouldn't know how to access the SVGA portion of memory.\par
|
|
\par
|
|
|
|
Executor's blitter uses the techniques of partial evaluation and dynamic code generation to eliminate redundant work and also give us access to SVGA memory. On the 80x86 each scanline is quickly translated into executable code, and that code gets executed
|
|
once each time the scanline needs to be drawn. On non-80x86 platforms, each scanline is compiled into threaded code which is executed by a machine-generated interpreter to draw the scanlines.\par
|
|
\par
|
|
|
|
Before describing how the dynamic compilation process works, let's take a look at an example. Consider the case where a 401x300 rectangle is to be filled with white pixels (pixel value zero on the Macintosh). This might happen, for example, when erasing
|
|
a window. Furthermore, let's assume that the target bitmap has four bits per pixel, since that's somewhat tricker to handle than 8 bits per pixel. Figure 4 shows the subroutine that Executor dynamically generates to draw this rectangle on a Pentium.
|
|
\par
|
|
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 loop:\tab andl\tab $0xff,0x50(%edi)\tab \tab ; clear leftmost 6 boundary pixels\par
|
|
\tab addl\tab $0x54,%edi\tab \tab \tab ; set up pointer for loop\par
|
|
\tab movl\tab $0x31,%ecx\tab \tab \tab ; set up loop counter\par
|
|
\tab rep\par
|
|
\tab stosl\tab \tab \tab \tab \tab ; slam out 49 aligned longs\par
|
|
\tab andl\tab $0xffff0f00,0x0(%edi)\tab ; clear 3 right boundary pixels\par
|
|
\tab addl\tab $0x28,%edi\tab \tab \tab ; move to next row\par
|
|
\tab decl\tab %edx\tab \tab \tab \tab ; decrement # of rows left\par
|
|
\tab jne\tab loop\tab \tab \tab \tab ; continue looping if appropriate\par
|
|
\tab ret\tab \tab \tab \tab \tab ; we're done!\par
|
|
\par
|
|
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\plain \f16 Figure 4. Dynamically generated blitting code}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 \par
|
|
\page \page This code, when called with the proper values in its input registers, will draw the entire rectangle. Note how the inner loop is merely a\par
|
|
"{\f22 rep ; stosl}
|
|
"...it doesn't get much more concise than that! The astute reader will know that on certain 80x86 processors "rep ; stosl" is not the fastest possible way to set a range of memory. This is true, but because our code generation is dynamic, in the future w
|
|
e can tailor the specific code sequence generated to the processor on which Executor is currently running. The blitter already does this when it needs to emit a byte swap; on the 80486 and up we use the "bswap" instruction, and on th
|
|
e 80386 (which doesn't support "bswap") we use a sequence of rotates.\par
|
|
\par
|
|
One thing you may notice in this example is that the bit masks used to clear the boundary pixels look strange. They are actually correct, since 80x86 processors are little endian.\par
|
|
\par
|
|
Unlike some processors, such as the 68040, the 80x86 instruction and data caches are always coherent. Consequently, no cache flushes need to be performed before the dynamically created code can be executed.\par
|
|
\par
|
|
Figure 5 contains another example, this time drawn from a real application. The program "Globe", by Paul Mercer, draws a spinning globe \par
|
|
\par
|
|
on the screen as fast as it can. Each "globe frame" is a 128x128 Pixmap. Here is the code that Executor generates and runs when Globe uses CopyBits to transfer one frame to the screen at 8 bits per pixel.\par
|
|
\par
|
|
Again the inner loop is very tight, just a "rep ; movsl" this time.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 \page Meta-Assembler \par
|
|
\pard\plain \f16\fs20
|
|
No matter how fast the generated code, if Executor spends too much time generating that code then any speedup will be negated by the increased time required for dynamic compilation. Consequently, the dynamic compilation from Region to 80x86 code needs to
|
|
be fast. We solved this problem with a "meta-assembler" written in Perl.\par
|
|
\par
|
|
Whereas an assembler tells a computer how to translate assembly instructions into machine code, our meta-assembler tells the computer how to generate tiny translators. These translators will then be used to translate pixel manipulation requests into
|
|
machine code. Another way of looking at it is that the meta-assembler generates code that generates code. This meta-assembly process is done only once: when Executor is compiled.\par
|
|
\par
|
|
The blitter operates on aligned longs in the destination bitmap. As the compilation engine strides through the scanline's start/stop pairs from left to right, it identifies which bits in each long are part of the Region and determines which of several pi
|
|
xel manipulation requests to issue to the tiny translators that were created by the meta-assembler.\par
|
|
\par
|
|
\pard \fi-180\li540 \bullet Some but not all bits in the current long are in the Region.\par
|
|
\par
|
|
\bullet All bits in the current long are in the Region.\par
|
|
\par
|
|
\bullet All bits in this long and the next long are in the Region.\par
|
|
\par
|
|
\bullet All bits in this long and the next two longs are in the Region.\par
|
|
\par
|
|
\bullet All bits in this long and the next three longs are in the Region.\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 \f16\fs20 {\f22 loop:\tab movl $0x20,%ecx\tab \tab
|
|
; set up loop counter for 32 longs\par
|
|
\tab rep\par
|
|
\tab movsl\tab \tab \tab \tab ; copy one row (128 bytes)\par
|
|
\tab addl $0xffffff00,%esi\tab ; advance to previous src row\par
|
|
\tab addl $0xfffffd00,%edi\tab ; advance to previous dst row\par
|
|
\tab decl %edx\tab \tab \tab ; decrement # of rows remaining\par
|
|
\tab jne loop\par
|
|
\tab ret\par
|
|
}\pard \qc\box\brdrs \phmrg\posxc\posyb\dxfrtext180\shading1000 {\plain \f16 Figure 5. Blitting code from Globe}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \fi-180\li360 \f16\fs20 \page \page \bullet
|
|
More than four contiguous longs are\par
|
|
\pard \li540 completely in the Region, and the number of longs equals 0 mod 4.\par
|
|
\pard \fi-180\li540 \par
|
|
\bullet More than four contiguous longs are completely in the Region, and the number of longs equals 1 mod 4.\par
|
|
\par
|
|
\bullet More than four contiguous longs are completely in the Region, and the number of longs equals 2 mod 4.\par
|
|
\par
|
|
\bullet More than four contiguous longs are completely in the Region, and the number of longs equals 3 mod 4.\par
|
|
\pard \par
|
|
The particular case encountered determines which function pointer to load from a lookup table corresponding to the current drawing mode. For example, the "patCopy" drawing mode has on
|
|
e table of function pointers, "patXor" another. There are also some special case tables for drawing patterns that are either all zero bits or all one bits.\par
|
|
\par
|
|
The main blitter doesn't care what drawing mode is being used, since it does all mode-specific work through the supplied function pointer table.\par
|
|
\par
|
|
Each function pointer points to a function that generates 80x86 code for the appropriate case. For example, one function generates code for a "patCopy" to three contiguous longs, one generates code for "patX
|
|
or" only to certain specified bits within one long, etc.\par
|
|
\par
|
|
|
|
The blitter compilation engine marches through the Region scanline from left to right, calling code generation functions as it goes. The generated code is accrued into a 32-byte aligned buffer on the stack. In this way, the blitter constructs a subroutine
|
|
to draw the Region.\par
|
|
\par
|
|
The compilation engine isn't very complicated. The tricky part is the numerous generation subroutines, which need to be fast since they are called so often and need to be easy to writ
|
|
e since there are so many of them. For each drawing mode there's one for each case the compilation engine cares about. For pattern drawing modes, there are separate specialized sub
|
|
routines for cases like patterns that can be entirely expressed in one 32-bit value ("short/narrow") patterns, patterns which can be expressed as one 32-bit value for each row, but which vary per row ("tall/narrow"), as well as "wide" variants of both. Bey
|
|
ond that, there are some versions specialized for 80486 and higher processors (which have the "bswap" instruction).\par
|
|
\par
|
|
Generating fast and robust code generators is where the Perl meta-assembler comes into play.\par
|
|
\par
|
|
The meta-assembler takes as input an assembly language template, and generates as output Pentium-scheduled assembly code that outputs an 80x86 binary for the input template. This process only takes place when Executor is compiled.
|
|
Got it? This can be a little confusing, so a few examples are in order.\par
|
|
\par
|
|
Here is perhaps the simplest template:\par
|
|
\par
|
|
{\f22 @meta copy_short_narrow_1\par
|
|
\tab movl\tab %eax,@param_offset@(%edi)\par
|
|
@endmeta\par
|
|
}\par
|
|
This template describes what should be done when the blitter wants to write one long to memory. The meta-assembler processes that into this 80x86 assembly code which is to be called by the blitter compilation engine:\par
|
|
\par
|
|
{\f22 \tab .align\tab 4,0x90\par
|
|
_xdblt_copy_short_narrow_1:\par
|
|
\tab movw\tab $0x8789,(%edi)\par
|
|
\tab movl\tab %eax,2(%edi)\par
|
|
\tab addl\tab $6,%edi\par
|
|
\tab ret\par
|
|
}\par
|
|
The subroutine that the meta-assembler has produced above, when executed, will generate the movl instruction (i.e. the movl instruction in the template) followed by its argument. The meta-assembler has deduced that "movl" in the ex
|
|
ample template is 80x86 opcode 0x8789. \par
|
|
\par
|
|
\page Let's take a look at a more complicated template. This template handles the case where we want to bitwise OR a pattern to t
|
|
he destination bitmap, and the number of longs to transfer equals zero mod 4 (e.g. if the blitter wants to OR 36 longs to memory):\par
|
|
\par
|
|
{\f22 @meta or_short_narrow_many_mod_0\par
|
|
\tab addl\tab $@param_offset@,%edi\par
|
|
movl\tab $@param_l_cnt_div_4@,%ecx\par
|
|
1:\tab orl\tab %eax,(%edi)\par
|
|
\tab orl\tab %eax,4(%edi)\par
|
|
\tab orl\tab %eax,8(%edi)\par
|
|
\tab orl\tab %eax,12(%edi)\par
|
|
\tab addl\tab $16,%edi\par
|
|
\tab decl\tab %ecx\par
|
|
\tab jnz\tab 1b\par
|
|
@lit\tab leal\tab (%eax,%edx,4),%ecx\par
|
|
@lit\tab addl\tab %ecx,edi_offset\par
|
|
@endmeta\par
|
|
}\par
|
|
The meta-assembler compiles that to this:\par
|
|
\par
|
|
{\f22 \tab .align\tab 4,0x90\par
|
|
_xdblt_or_short_narrow_many_mod_0:\par
|
|
\tab movw\tab $0xC781,(%edi)\par
|
|
\tab movl\tab %eax,2(%edi)\par
|
|
\tab movl\tab $0x47090709,11(%edi)\par
|
|
\tab movb\tab $0xB9,6(%edi)\par
|
|
\tab movl\tab $0x8470904,15(%edi)\par
|
|
\tab movl\tab $0x754910C7,23(%edi)\par
|
|
\tab movl\tab $0x830C4709,19(%edi)\par
|
|
\tab movb\tab $0xEF,27(%edi)\par
|
|
\tab movl\tab %edx,%ecx\par
|
|
\tab shrl\tab $2,%ecx\par
|
|
\tab movl\tab %ecx,7(%edi)\par
|
|
\tab addl\tab $28,%edi\par
|
|
\tab leal\tab (%eax,%edx,4),%ecx\par
|
|
\tab addl\tab %ecx,edi_offset\par
|
|
\tab ret\par
|
|
}\par
|
|
This mechanically generated subroutine generates the executable 80x86 binary for the "or_short_narrow_many_mod_0" template. It gets called by the blitter compilation engine when it needs code to OR a bunch of longs to memory. \par
|
|
\par
|
|
The output of the meta-assembler isn't meant for human consumption. As such, the output contains a hodge-podge of magic numbers ({\f22 0x47090709, 0xB9, 0x8470904,} etc.). These numbers are fixed machine code values corresponding to opcodes,
|
|
constant operands, and other values.\par
|
|
\par
|
|
|
|
Even though this subroutine is longer than the previous example, it still doesn't take very long to execute. Furthermore, it only gets called when the blitter has determined that many longs are to be ORed to memory, so the time taken actually blitting to
|
|
memory will typically dwarf the time taken to execute these 15 code generation instructions.\par
|
|
\par
|
|
The meta-assembler is a Perl script that works by running numerous syntactically modified versions of the assembly template t
|
|
hrough "gas", the GNU assembler, and examining the output bytes to discover which bits are fixed opcode bits and which bits correspond to operands. Once it has figured out what goes where, it generates 80x86 assembly code which writes out the constant byt
|
|
es and computes and writes out the operand bytes. That code is run through a simple Pentium instruction scheduler and the meta-assembler is done. This entire process is, of course, done only once, when Executor is compiled.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 A Portable Dynamically Recompiling Blitter\par
|
|
\pard\plain \f16\fs20 Although the meta-assembler-based blitter works only on 80x86 processors, Executor itself can run on non-Intel processors. On other CPUs (such as the 68040 used in the NeXTstation) Executor's blitter works somewhat differently.
|
|
\par
|
|
\par
|
|
|
|
The basic idea is still the same: translate Region scanlines into an efficient form once and then use that efficient form each time the scanline gets drawn. This time, however, the "efficient form" is processor independent, and the blitter is written enti
|
|
rely in C.\par
|
|
\par
|
|
As is the case with the 80x86-specific blitter, the portable blitter compilation
|
|
engine examines scanline start/stop pairs and identifies which of several cases is appropriate. One case is "output three longs", another is "output only certain pixels within the current long", and so on.\par
|
|
\par
|
|
Like the 80x86-specific blitter, the particular case encountered determines which entry in a lookup table will be used. But there the similarity ends. The lookup tables contain pointers to C code labels{\fs18\up6 \chftn {\footnote \pard\plain \s246
|
|
\f16\fs20 {\fs18\up6 \chftn }"What the heck is a pointer to a C code label?", you ask? gcc (the GNU C compiler) has a "pointer to label" extension to the C language which makes the statement "{\f22 &&my_label"} evaluate to a\par
|
|
"{\f22 void *}" that points to the compiled code for "{\f22 my_label:}" within a C function. This, combined with gcc's "{\f22 goto void *}" extension, allows C programs to execute goto statements whose destinations are not known at compile time.}}
|
|
rather than to routines that generates 80x86 code on the fly.\par
|
|
\par
|
|
Each scanline gets translated into an array of opcodes for the "blitter opcode interpreter" (which will be described below). Each opcode is stored in one of these C structs:\par
|
|
\par
|
|
{\f22 struct\par
|
|
\{\par
|
|
/* Pointer to C code to handle\par
|
|
this opcode. */\par
|
|
const void *label;\par
|
|
\par
|
|
/* Offset into scanline */\par
|
|
int32 offset;\tab \tab \par
|
|
\par
|
|
/* Extra operand with\par
|
|
different uses. */\par
|
|
int32 arg;\tab \tab \par
|
|
\};\par
|
|
}\par
|
|
For example, consider the case where the blitter wants to write out five contiguous longs from a "simple" pattern starting 64 bytes into the current row.
|
|
In this case, "label" would equal "&©_short_narrow_many_5", "offset" would equal 64, and "arg" would equal 5.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 The Blitter Opcode Interpreter\par
|
|
\pard\plain \f16\fs20
|
|
The blitter opcode interpreter is machine generated C code created by a Perl script when Executor is compiled. That Perl script takes as input C code snippets that tell it how to handle particular drawing modes, and produces as output C code for an interp
|
|
reter.\par
|
|
\par
|
|
Here is the template taken as input by the Perl script for the "copy_short_narrow" case. This is the simple case where the pixels for the pattern being displayed can be stored entirely within one 32-bit long (for example, solid white or solid black).
|
|
\par
|
|
\par
|
|
{\f22 begin_mode cpy_shrt_narrow max_unwrap\par
|
|
repeat\tab @dst@ = v;\par
|
|
mask\tab @dst@ = (@dst@ & ~arg)\par
|
|
\tab \tab \tab | (v & arg);\par
|
|
end_mode\par
|
|
}\par
|
|
The "{\f22 repeat}" field tells the Perl script what C code to generate for the simple case where all pixels in the destination long are to be affected. The "mask" case tells it what to do when it must only modify certain bits in the target long and must
|
|
leave others alone. Max_unwrap tells the Perl script to unroll the new blitting loop.\par
|
|
\par
|
|
The generated interpreter takes as input an array of blitter opcode structs, which it then proceeds to interpret once for each row to be drawn.\par
|
|
\par
|
|
|
|
Here is the section of the (machine-generated) interpreter that handles the copy_short_narrow cases. Remember that each "blitter opcode" is really just a pointer to one of these C labels. This code would get used when filling a rectangle with a solid col
|
|
or.\par
|
|
\par
|
|
{\f22 copy_short_narrow_mask:\par
|
|
*dst = (*dst & ~arg) | (v & arg);\par
|
|
JUMP_TO_NEXT;\par
|
|
copy_short_narrow_many_loop:\par
|
|
dst += 8;\par
|
|
copy_short_narrow_many_8:\par
|
|
dst[0] = v;\par
|
|
copy_short_narrow_many_7:\par
|
|
dst[1] = v;\par
|
|
copy_short_narrow_many_6:\par
|
|
dst[2] = v;\par
|
|
copy_short_narrow_many_5:\par
|
|
dst[3] = v;\par
|
|
copy_short_narrow_many_4:\par
|
|
dst[4] = v;\par
|
|
copy_short_narrow_many_3:\par
|
|
dst[5] = v;\par
|
|
copy_short_narrow_many_2:\par
|
|
dst[6] = v;\par
|
|
copy_short_narrow_many_1:\par
|
|
dst[7] = v;\par
|
|
if ((arg -= 8) > 0)\par
|
|
goto copy_short_narrow_many_loop;\par
|
|
JUMP_TO_NEXT;\par
|
|
}\par
|
|
Note how the inner blitting loop is "unwrapped" for speed. A blitter opcode would specify that 39 longs are to be output by making its "arg" field be 39 and the "label" field point to "copy_short_narrow_many_7", in the middle of the unwrapped loop
|
|
(39 mod 8 equals 7). The interpreter would jump there and loop until all of the pixels had been written out, at 32 bytes per loop iteration. This is very fast, especially for portable code.\par
|
|
\par
|
|
Of course, if any other pixels needed to be drawn, there would be additional blitter opcode structs telling the interpreter what to do. The interpreter dispatches to the next opcode by executing the "JUMP_TO_NEXT" macro, which automatically uses GCC
|
|
's "goto void *" extension to "goto" the C label that handles the next opcode.\par
|
|
\pard\plain \s255\sb240 \b\f21\ul Development Tools\par
|
|
\pard\plain \s254\sb120 \b\f21 Free Software\par
|
|
\pard\plain \f16\fs20 It is true that ARDI has a very tight R&D budget, but we really don't skimp on the tools that we use to build Executor. We use free software to develop Executor because we like to push the tools that we use very har
|
|
d and the only way we can do that and still sleep at night is when we know that if we find bugs in our tools
|
|
that they can be fixed quickly. With free software the worst case is to fix bugs ourselves, and that worst case is actually much better than the average case with non-free software where you report a bug and pray for a patch. In reality it's rare that we
|
|
even have to resort to the worst case since bugs reported are often fixed in less than a day.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 GCC\par
|
|
\pard\plain \f16\fs20 GCC is the Free Software Foundation's C compiler. It produces good code and has a powerful inline assembly syntax that allows optimization to be done on the expressions in
|
|
the inline assembly without the optimization ruining the assembly you've written.\par
|
|
\par
|
|
Another handy GCC extension is "{\f22 typeof}
|
|
" which can be used in macros to cast a value to the type of a different value. The combination of powerful inline assembly and typeof allows us to have efficient macros that swap bytes in a 16 bit or 32 bit quantity. Since the Mac and PC are of differe
|
|
nt endianness, quick byte swapping routines are very important.\par
|
|
\par
|
|
As mentioned above in our synthetic CPU and portable blitter descriptions, we also use GCC's ability to take the address of a label and store it in a variable so that we can produce our own threaded code on the fly.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Hacked GCC\par
|
|
\pard\plain \f16\fs20 Because the source to GCC is available, it is possible, although not necessarily advisable, to hack in custom extensions. At ARDI we've done this twice in the past. At one time we used a set of locally written modificat
|
|
ions to support the pascal keyword so that we could automatically call functions using Pascal calling conventions. At the same time we also supported '{\f22 1234'} (i.e.
|
|
the ability to construct a 32-bit quantity out of four character constants inside apostrophes). Eventually we decided that we didn't get enough benefit from these extensions to make it worth patching new versions of GCC as they came out.\par
|
|
\par
|
|
The other time we modified GCC was when we were porting Executor to DEC's Alpha processor. We were doing this under OSF/1 which uses 64-bit pointers. Since Executor needs to use the same internal representa
|
|
tion that Macs use, we wanted a way to easily write 32-bit pointers to memory in such a way that they would be extended to 64-bits when they were read into a register for use. To do this we made GCC
|
|
support "pointer bit fields", a logical extension that allowed bit-field notation to be used when specifying pointers. At that time we didn't have a resident GCC
|
|
expert, so we were lucky that such modifications basically consisted in taking out a few checks that disallowed such constructs. Once those checks were removed, pointer bit-fields, "just worked".\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 DJGPP\par
|
|
\pard\plain \f16\fs20 DJGPP is DJ Delorie's (see {\f22 http://www.delorie.com}) port of GCC to MSDOS. It allows DOS users to compile UNIX programs under DOS and to run them with little or no modification. DJGPP is GCC and associated development tools
|
|
with a special UNIX like C-
|
|
library and a "DOS Extender". DOS extenders are used to combat OS inferiority. DOS is a 16-bit OS, whereas most relatively modern OSes are 32-bit. DOS extenders allow 32-bit programs to run under DOS. Executor is one such program. In fact, we use the
|
|
djgpp libraries and DOS extender but we don't actually use the DOS port of GCC, because we don't like DOS. We like Linux and GCC
|
|
is well structured so we can do cross-compilation and cross-linking with the djgpp libraries and build our DOS product under Linux. We completely compile the DOS version of Executor under Linux. We then copy the
|
|
new Executor binary to a DOS partition, reboot to DOS, test Executor and then get the heck out of DOS. Time spent using Executor is more like a Mac than it is like DOS.\par
|
|
\pard\plain \s255\sb240 \b\f21\ul \page Debugging Tools\par
|
|
\pard\plain \f16\fs20 Internally we have many debugging tools to help us figure out why an application may die or misbehave under Executor.\par
|
|
\pard\plain \s254\sb120 \b\f21 More Free Software\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 GDB in General\par
|
|
\pard\plain \f16\fs20 Almost all of our debugging is done under the GDB debugger. As with GCC, we're not using GDB because it's the free debugger; we're using the free debugger because it's GDB. GDB is quite powerful.\par
|
|
\par
|
|
Whenever we find that a given application fails under Executor, we try to reproduce the failure under Linux. Debugging on a system that has complete memory protection and pre-emptive multi-tasking means that your system stays up even when your ap
|
|
plication crashes. There's also no need to worry that when a program is misbehaving that it's subtly corrupting other programs on the system.\par
|
|
\pard\plain \s252\li360 \f16\ul hardware watch points\par
|
|
\pard\plain \f16\fs20 Beyond the features that are handed to us due to the underlying robustness of the OS, GDB also supports hardware watch points, at least on 80x86 based PCs. "80x86"s
|
|
have the ability to use hardware to watch a small set of memory locations to see when they change. Since the checking is done by hardware, the program runs at full speed until the memory location is modified, at which point the debugger stops, tells us
|
|
which instruction modified which memory address and what the old and new values are for that address.\par
|
|
\par
|
|
As an example, assume we want to know when the low-memory global {\f22 TheMenu} is changing, here is how it might look under GDB:\par
|
|
\par
|
|
{\f22 (gdb) watch TheMenu\par
|
|
Hardware watchpoint 1: TheMenu\par
|
|
(gdb) c\par
|
|
Continuing.\par
|
|
Hardware watchpoint 1: TheMenu\par
|
|
\par
|
|
Old value = 0\par
|
|
New value = 768\par
|
|
C_HiliteMenu (mid=3) at menu.c:877\par
|
|
(gdb) swap16 768\par
|
|
$2 = 0x3\par
|
|
(gdb) c\par
|
|
Continuing.\par
|
|
Hardware watchpoint 1: TheMenu\par
|
|
\par
|
|
Old value = 768\par
|
|
New value = 0\par
|
|
C_HiliteMenu (mid=0) at menu.c:877\par
|
|
(gdb) delete 3\par
|
|
(gdb) c\par
|
|
Continuing.\par
|
|
}\par
|
|
At the first {\f22 (gdb) }prompt above, we tell GDB that we want to be alerted whenever the expression "TheMenu" changes. GDB is clever enough to realize that it can watch that expression
|
|
with a hardware watchpoint, so it assigns watchpoint 1 to the task. We then continue, which allows Executor to continue running whatever program it was already running.{\fs18\up6 \chftn {\footnote \pard\plain \s246 \f16\fs20 {\fs18\up6 \chftn }
|
|
I actually set this watchpoint in the session of Executor that I am using to run Word 5.1 for the Macintosh to compose this document (Executor/Linux on a 90 MHz Pentium).}}\par
|
|
\par
|
|
Eventually, when the menu bar was accessed, GDB
|
|
told us that TheMenu had changed from 0 to 768. 768 may sound like a weird value for TheMenu to take, but this is on a byte swapped machine, so we need to swap that 16-bit value to see what the TheMenu would look like to a Mac program and we find that it
|
|
's 3, a sane value for TheMenu, after all. We let the program continue and later TheMenu is changed back to zero.\par
|
|
\par
|
|
You can't see it, but in another window the source to Executor is displayed so that we are automatically shown the 877th line of menu.c when GDB's watch point triggers there.\par
|
|
\par
|
|
The argument to the watch command is an arbitrary expression, so it is possible to watch for much more complex changes than our example demonstrated. A
|
|
lthough only relatively simple watchpoints will be handled by hardware watchpoints, the others will be handled by software watchpoints which are much slower.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 \page Hacked GDB\par
|
|
\pard\plain \f16\fs20 Unlike GCC, where we made local modifications and then, upon reflection, threw them out, we have made a slight change to GDB that is a big win for debugging Executor (and Mac programs running under Executor) on PCs. GDB
|
|
always knows how to disassemble the object code that it's running, and GDB is available for many architectures, so we modified GDB so that on the 80x86 we can do both 80x86 disassembly and 680x0
|
|
disassembly. That allows us to look at sections of memory within our emulator and see what 680x0 code is there.\par
|
|
\par
|
|
In the example below, Executor is running the game Risk, when we interrupt Executor and then tell GDB to break in the routine alinehandler. We then continue until alinehandler is hit. We then disassemble, in 680x0 format,
|
|
the first nine instructions at the location from which alinehandler was dispatched. After doing that we disassemble in 80x86 format the first nine instructions of alinehandler itself.\par
|
|
\par
|
|
{\f22 (gdb) b alinehandler\par
|
|
Breakpoint 6 at 0x17ce2d: file executor.c, line 369.\par
|
|
(gdb) c\par
|
|
Continuing.\par
|
|
\par
|
|
Breakpoint 6, alinehandler (pc=3652006, ignored=0x0) at executor.c:369\par
|
|
(gdb) set m68k\par
|
|
(gdb) x/9i pc\par
|
|
0x37b9a6 :\tab _SystemTask\par
|
|
0x37b9a8 :\tab clrw sp@-\par
|
|
0x37b9aa :\tab movew #-1,sp@-\par
|
|
0x37b9ae :\tab pea a5@(-27598)\par
|
|
0x37b9b2 :\tab _GetNextEvent\par
|
|
0x37b9b4 :\tab moveb sp@+,d0\par
|
|
0x37b9b6 :\tab tstb d0\par
|
|
0x37b9b8 :\tab beqw 0x37ba0e <end+667542>\par
|
|
0x37b9bc :\tab movew a5@(-27598),d0\par
|
|
(gdb) set m68k off\par
|
|
(gdb) x/9i alinehandler\par
|
|
<alinehandler>: pushl %ebp\par
|
|
<alinehandler+1>:\tab movl %esp,%ebp\par
|
|
<alinehandler+3>:\tab subl $0x28,%esp\par
|
|
<alinehandler+6>:\tab pushl %esi\par
|
|
<alinehandler+7>:\tab pushl %ebx\par
|
|
<alinehandler+8>:\tab jmp 0x17ce10 <alinehandler+48>\par
|
|
<alinehandler+10>:\tab nop \par
|
|
<alinehandler+11>:\tab nop \par
|
|
<alinehandler+12>:\tab nop} \par
|
|
\par
|
|
Being able to disassemble 680x0 code on the 80x86 required us to change approximately 50 source lines of GDB (remember, the 680x0 disassembly code was already present for use in GDB on 680x0 machines). We also added a set of tables so that a-line traps
|
|
and low-memory globals are displayed by name, rather than by number.\par
|
|
\par
|
|
Although our special circumstances led us to modify the GDB source code, GDB is customizable out of the box. We've defined a handful of macros that automate debugging tasks. Figure 6 is a macro that crawls through the stack in mac space.\par
|
|
\par
|
|
For comparison, Figure 7 is what GDB produces when backtracking code that is compiled with GDB debugging symbols.\par
|
|
\sect \sectd \sbknone\linemod0\linex0\cols1\endnhere \pard\plain \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 \f16\fs20 {\f22 define macktrace\par
|
|
set $_fp = cpu_state.regs[14].ul.n + 0\par
|
|
silentswap32 (((uint32*)$_fp)[1]+0)\par
|
|
set $_pc = $_val + 0\par
|
|
silentswap32 (((uint32*)$_fp)[0]+0)\par
|
|
set $_fp = $_val + 0\par
|
|
while $_fp > 100 && $_fp < 30000000\par
|
|
set $_start = (long) $_pc + 0\par
|
|
while $_start > (long)&end && *(uint16 *)$_start != 0x564E\par
|
|
\tab set $_start = $_start - 2\par
|
|
end\par
|
|
printf "func=0x%lX, ret=0x%lX, fp=0x%lX, args=0x%02X%02X%02X%02X 0x%02X%02X%02X%02X 0x%02X%02X%02X%02X\\n",\\\par
|
|
\tab $_start, $_pc, $_fp,\\\par
|
|
\tab ((uint8 *)$_fp)[8], ((uint8 *)$_fp)[9], ((uint8 *)$_fp)[10],\\\par
|
|
\tab ((uint8 *)$_fp)[11], ((uint8 *)$_fp)[12], ((uint8 *)$_fp)[13],\\\par
|
|
\tab ((uint8 *)$_fp)[14], ((uint8 *)$_fp)[15], ((uint8 *)$_fp)[16],\\\par
|
|
\tab ((uint8 *)$_fp)[17], ((uint8 *)$_fp)[18], ((uint8 *)$_fp)[19]\par
|
|
silentswap32 ((uint32*)$_fp)[1]+0\par
|
|
set $_pc = $_val + 0\par
|
|
silentswap32 ((uint32*)$_fp)[0]+0\par
|
|
set $_fp = $_val + 0\par
|
|
end\par
|
|
end\par
|
|
(gdb) macktrace\par
|
|
func=0x3824F8, ret=0x38250E, fp=0xB28E3C, args=0x00B2E852 0x000300B2 0x8E580037\par
|
|
func=0x37A2BE, ret=0x37A3A6, fp=0xB28E4A, args=0x0037BA12 0x000000B2 0x8F840037\par
|
|
func=0x37AECE, ret=0x37AFF0, fp=0xB28E58, args=0x0001002E 0xE0BC0000 0x00010035\par
|
|
func=0x379D58, ret=0x379E0C, fp=0xB28F84, args=0x000100B2 0x8F9200B2 0x8F9A0000\par
|
|
}\pard \qc\box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\plain \f16 \par
|
|
Figure 6. Macktrace Definition and Example\par
|
|
}{\f22 \par
|
|
}\pard \box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\f22 (gdb) backtrace\par
|
|
#0 C_SysBeep (i=10) at osutil.c:837\par
|
|
#1 0x18934d in PascalToCCall (ignoreme=2271560241, infop=0x29faa4)\par
|
|
at emutrap.c:94\par
|
|
#2 0x17d0c9 in alinehandler (pc=3661160, ignored=0x0)\par
|
|
at executor.c:399\par
|
|
#3 0x1c1b85 in trap_direct (trap_number=10, exception_pc=3661160, \par
|
|
exception_address=0) at trap.c:201\par
|
|
#4 0x197cfc in S68K_HANDLE_0x00B5 () at syn68k.c:1038\par
|
|
#5 0x196067 in interpret_code (start_code=0x2df6c4) at syn68k.c:587\par
|
|
#6 0x12d476 in beginexecutingat (startpc=11730018)\par
|
|
at launch.c:328\par
|
|
#7 0x12e1ce in launchchain (fName=0x2b53f8 "\\004Risk", vRefNum=-32717, \par
|
|
resetmemory=1 '\\001') at launch.c:575\par
|
|
#8 0x12f6e0 in Launch (\par
|
|
fName_arg=0x910 "\\004Riskutor", '\'ff' <repeats 27 times>, vRefNum_arg=-32717)\par
|
|
at launch.c:1142\par
|
|
#9 0x17e1f7 in executor_main ()\par
|
|
at executor.c:589\par
|
|
#10 0x13371a in main (argc=2, argv=0xbffffa04)\par
|
|
at main.c:2112}\par
|
|
\pard \qc\box\brdrs \phmrg\posxc\dxfrtext180\shading1000 {\plain \f16 Figure 7. GDB backtrace}\sect \sectd \sbknone\linemod0\linex0\cols2\endnhere \pard\plain \f16\fs20 \page \page
|
|
As you might guess, this disparity of information makes it much easier for for us to track down bugs in our own code then finding bizarre incompatibilities in the code that is being run under the emulator.\par
|
|
\pard\plain \s254\sb120 \b\f21 Disassembler\par
|
|
\pard\plain \f16\fs20 Since GDB already knows how to disassemble 680x0 code it was possible to write a driver for GDB
|
|
so that it can disassemble Mac programs. The driver is about 1,000 lines of C code, with another 500 lines describing the low-memory globals. Basically the driver knows about CODE resources and how intersegment jumps work. GDB
|
|
normally doesn't produce labels for jump targets or the beginning of subroutines, so the driver adds those too, to make the output that much easier to read.\'13\par
|
|
\pard\plain \s254\sb120 \b\f21 Run-time Aids\par
|
|
\pard\plain \f16\fs20 Because we're using our own set of OS and Toolbox routines, we can add code that is conditionally compiled into debug versions of Executor that can provide still more information than GDB or GDB macros can.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Debugtable, Debugnumber\par
|
|
\pard\plain \f16\fs20 Our A-line trap handler has a table, known as debugtable, of 4096 32-bit ints that it updates each time a trap is taken. Each time alinehandler is called, a variable known as "debugnumber" is incremented and the
|
|
n the value of debugnumber is stored in the slot in debugtable corresponding to the aline trap that was called. This allows us to see both what traps were recently executed and a complete list of every trap that an application makes
|
|
, no matter how long the application has run.\par
|
|
\par
|
|
\pard This scheme has its drawbacks. Traps that are dispatched via selectors are all lumped together. Traps whose addresses are taken and then are called by jumps through the address don't show up in debugtable.
|
|
Although debugtable and debugnumber are perhaps the least
|
|
sophisticated portion of Executor, they're still quite handy, since a visual inspection of the last 100 traps made before an application died often gives a good idea of where to start looking for the source of the incompatibility.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 XX_slam\par
|
|
\pard\plain \f16\fs20 In the course of developing Executor, we did a major rewrite of our memory manager and our TextEdit
|
|
replacement. In both cases it's not enough to just implement the APIs that are defined in Inside Macintosh, we also have to duplicate the in-memory data structures so that programs which count on them will run properly.
|
|
To help us verify that we weren't adding new bugs when we rewrote those subsystems we added routines that would consistency check the data structures that each of those subsystems support.\par
|
|
\par
|
|
\pard Because these consistency checks are thorough but time consuming, we call them "slams", and by default they are not enabled, even in debugging versions of Executor.
|
|
When they are enabled, the data structures for each subsystem are slammed at the entry to a call that might modify one of the data structures and the data structure is slammed once again on exit of the routine.
|
|
We can turn them on at run-time either by using a command line option when Executor is started or by using GDB to enable the slamming. This is something we should have done for all of Executor's subsystems from day one, since it
|
|
's ever so helpful to be told that going into routine XXX, the heap was fine, but coming out the heap was corrupted.\par
|
|
\pard\plain \s253\li360\sb120 \b\f16 Image Viewer\par
|
|
\pard\plain \f16\fs20
|
|
Reading disassembled code is much easier than staring at hex numbers. Similarly, being able to view a portion of memory as some sort of PixMap (assuming that the memory really is a bit image) is also better than staring at a bunch of hex numbers. When we
|
|
build Executor for X-Windows, we also build an image server that uses UNIX interprocess communication to communicate with the process being debugged under GDB. This allows us to monitor offscreen graphics, which can be very important when an application
|
|
makes many graphics calls and eventually an abomination is drawn on the screen instead of what should have been drawn.\par
|
|
\par
|
|
\pard Our debugging arsenal includes other, more prosaic, tools. In fact, our debugging environment encourages the development of new tools, because it's so easy to leverage existing tools into new tools and even write new tools from scratch.\par
|
|
\pard\plain \s255\sb240 \b\f21\ul \page Future Plans\par
|
|
\pard\plain \f16\fs20 Much of VCPU, a successor to Syn68k, has already been written. VCPU performs many optimizations that
|
|
Syn68k does not, including improved register allocation, dead subregister elimination, opcode "widening", and moving work outside of loops. VCPU has a clean high-level syntax for specifying both front ends and back ends, allowing it to dynamically compil
|
|
e both PowerPC and m68k binaries on any architecture we decide to support.\par
|
|
\par
|
|
Although we don't explicitly mention it, the graphic subsyste
|
|
m one layer above the blitter already has hooks in it to allow use of graphics accelerators, where present. We plan a native port to Win32 and OS/2 and those ports should be able to use fancier
|
|
graphic subsystems and also make use of the underlying network APIs.\par
|
|
\par
|
|
Currently INITs and CDEVs do not run under Executor, but the same mechanisms that allow applications to run can also allow INITs and CDEVs to run. QuickTime and ATM will both be high priorities after Executor 2 ships.\par
|
|
\par
|
|
We will also be developing compiler tools that will allow ISVs to natively compile CPU specific routines to be used when their applications are run under Executor. Executor already uses such gateways internally.\par
|
|
\par
|
|
Already, multiple simultaneous instances of Executor can be run under NEXTSTEP and Linux (and to a lesser extent under Windows '95). Current
|
|
ly only Executor/NEXTSTEP handles PICT pasteboard cutting and pasting from one instantiation of Executor to another, and no versions of Executor do enough file locking to allow concurrent access of the same HFS volumes at once. This needs to be fixed
|
|
, since either through shared text segments under UNIX and UNIX like operating systems or through DLLs under Microsoft operating systems, it can be made fairly efficient to run multiple instances of Executor simultaneously
|
|
. When that is done, each instance of Executor has its own address space and is automatically scheduled by the underlying operating system scheduler. That mean
|
|
s that Executor "inherits" memory-protection and pre-emptive multi-tasking from the underlying core operating system.\par
|
|
\par
|
|
By properly exploiting this inheritance it should be possible to provide an environment that allows well-behaved Mac applications to run efficiently under a variety of PC operating systems with automatic protection from non-well-behaved applications.
|
|
\par
|
|
\par
|
|
One interesting variant on this theme would be to use Linux as the core OS, but to hide it from the end-user, for a net result of an 80x86 box that boots an efficient, robust MacOS-like environment. \sect \sectd \sbknone\linemod0\linex0\cols1\endnhere
|
|
\pard\plain \f16\fs20 {\plain \f16 \par
|
|
}} |