mirror of
https://github.com/ctm/executor.git
synced 2024-11-16 05:05:22 +00:00
379 lines
16 KiB
Plaintext
379 lines
16 KiB
Plaintext
[hammer home the distinction between compile-time and run-time]
|
|
|
|
A Region is a data structure that defines a set of pixels. ???
|
|
[briefly describe Apple's Region and ARDI's internal Region. ARDI's
|
|
region doesn't XOR adjacent scanlines, and stores X values in "native
|
|
endian" byte order for speed (y values are still big endian, which can
|
|
be irritating). Mac programs can never see a special region].
|
|
|
|
...We wanted to write a blitter that had good performance in the
|
|
common case, but did not want to spend a great deal of time writing
|
|
code to handle special cases...
|
|
|
|
|
|
|
|
One way to write a simple Region blitter is to start with a subroutine
|
|
that parses the start/stop pairs of a Region scanline and draws the
|
|
corresponding pixels. This subroutine is then called once for each
|
|
row of pixels to be displayed.
|
|
|
|
Unfortunately, this approach is slow since each scanline gets
|
|
re-parsed every time it is drawn. The Region for a 300 pixel tall
|
|
rectangle consists of a single scanline with a repeat count of "300";
|
|
this "simple Region blitter" will parse that scanline 300 times!
|
|
That's a lot of redundant work.
|
|
|
|
There are many possible ways to get away with parsing each scanline
|
|
only once. One approach is to convert the start/stop pairs into a bit
|
|
mask where the bits in the mask correspond to the bits in the target
|
|
bitmap that are to be changed. The inner blitting loop then becomes
|
|
an exercise in bitwise arithmetic. In C, such a loop might look
|
|
something like this:
|
|
|
|
for (x = left; x < right; x++)
|
|
dst[x] = (dst[x] & ~mask[x]) | (pattern_value & mask[x]);
|
|
|
|
That's not bad, but it's unnecessarily slow in the common case of
|
|
filling a rectangle. For a rectangular Region, mask[x] is usually all
|
|
one bits, making the bit munging a waste of time. And even when the
|
|
masks are never solid (e.g. when drawing a thin vertical line), this
|
|
technique is still unnecessarily slow. As it turns out, even the
|
|
cycles the CPU spends loading mask bits from memory are unnecessary.
|
|
|
|
Executor's blitter uses the techniques of partial evaluation and
|
|
dynamic code generation to eliminate redundant work. On the 80x86
|
|
each scanline is quickly translated into executable code, and that
|
|
code gets executed once each time the scanline needs to be drawn. On
|
|
non-80x86 platforms, each scanline is compiled into threaded code
|
|
which is executed by a machine-generated interpreter to draw the
|
|
scanlines.
|
|
|
|
|
|
80x86:
|
|
|
|
|
|
Before describing how the dynamic compilation process works, let's
|
|
take a look at an example. Consider the case where a 401x300
|
|
rectangle is to be filled with white pixels (pixel value zero on the
|
|
Macintosh). This might happen, for example, when erasing a window.
|
|
Furthermore, let's assume that the target bitmap has four bits per
|
|
pixel, since that's somewhat tricker to handle than 8 bits per pixel.
|
|
Here is the subroutine that Executor dynamically generates to draw
|
|
this rectangle on a Pentium:
|
|
|
|
loop: andl $0xff,0x50(%edi) # clear leftmost 6 boundary pixels
|
|
addl $0x54,%edi # set up pointer for loop
|
|
movl $0x31,%ecx # set up loop counter
|
|
rep ; stosl # slam out 49 aligned longs
|
|
andl $0xffff0f00,0x0(%edi) # clear 3 right boundary pixels
|
|
addl $0x28,%edi # move to next row
|
|
decl %edx # decrement # of rows left
|
|
jne loop # continue looping if appropriate
|
|
ret # we're done!
|
|
|
|
This code, when called with the proper values in its input registers,
|
|
will draw the entire rectangle. Note how the inner loop is merely a
|
|
"rep ; stosl"...it doesn't get much more concise than that! The
|
|
astute reader will know that on certain 80x86 processors "rep ; stosl"
|
|
is not the fastest possible way to set a range of memory. This is
|
|
true, but because our code generation is dynamic, in the future we can
|
|
tailor the specific code sequence generated to the processor on which
|
|
Executor is currently running. The blitter already does this when it
|
|
needs to emit a byte swap; on the 80486 and up we use the `bswap'
|
|
instruction, and on the 80386 (which doesn't support `bswap') we use a
|
|
sequence of rotates.
|
|
|
|
One thing you may notice is that the bit masks used to clear the
|
|
boundary pixels look strange. They are actually correct, since 80x86
|
|
processors are little endian.
|
|
|
|
Unlike some processors, such as the 68040, the 80x86 instruction and
|
|
data caches are always coherent. Consequently, no cache flushes need
|
|
to be performed before the dynamically created code can be executed.
|
|
|
|
Here's another example, this time drawn from a real application. The
|
|
program "Globe", by Paul Mercer, draws a spinning globe on the screen
|
|
as fast as it can. Each "globe frame" is a 128x128 Pixmap. Here is
|
|
the code that Executor generates and runs when Globe uses CopyBits to
|
|
transfer one frame to the screen at 8 bits per pixel:
|
|
|
|
loop: movl $0x20,%ecx ; set up loop counter for 32 longs
|
|
rep ; movsl ; copy one row (128 bytes)
|
|
addl $0xffffff00,%esi ; advance to previous src row
|
|
addl $0xfffffd00,%edi ; advance to previous dst row
|
|
decl %edx ; decrement # of rows remaining
|
|
jne loop
|
|
ret
|
|
|
|
Again the inner loop is very tight, just a "rep ; movsl" this time.
|
|
|
|
No matter how fast the generated code, if Executor spends too much
|
|
time generating that code then any speedup will be negated by the
|
|
increased time required for dynamic compilation. Consequently, the
|
|
dynamic compilation from Region to 80x86 code needs to be fast.
|
|
We solved this problem with a "meta-assembler" written in Perl.
|
|
|
|
The blitter operates on aligned longs in the destination bitmap. As
|
|
the compilation engine strides through the start/stop pairs from left
|
|
to right, it identifies which bits in each long are part of the Region
|
|
and determines which of several cases is appropriate:
|
|
|
|
- Some but not all bits in the current long are in the Region.
|
|
- All bits in the current long are in the Region.
|
|
- All bits in this long and the next long are in the Region.
|
|
- All bits in this long and the next two longs are in the Region.
|
|
- All bits in this long and the next three longs are in the Region.
|
|
- More than four contiguous longs are completely in the Region,
|
|
and the number of longs equals 0 mod 4.
|
|
- More than four contiguous longs are completely in the Region,
|
|
and the number of longs equals 1 mod 4.
|
|
- More than four contiguous longs are completely in the Region,
|
|
and the number of longs equals 2 mod 4.
|
|
- More than four contiguous longs are completely in the Region,
|
|
and the number of longs equals 3 mod 4.
|
|
|
|
|
|
The particular case encountered determines which function pointer to
|
|
load from a lookup table corresponding to the current drawing mode.
|
|
For example, the "patCopy" drawing mode has one table of function
|
|
pointers, "patXor" another. There are also some special case tables
|
|
for drawing patterns that are either all zero bits or all one bits.
|
|
|
|
The main blitter doesn't care what drawing mode is being used, since
|
|
it does all mode-specific work through the supplied function pointer
|
|
table.
|
|
|
|
Each function pointer points to a function that generates 80x86 code
|
|
for the appropriate case. For example, one function generates code
|
|
for a "patCopy" to three contiguous longs, one generates code for
|
|
"patXor" only to certain specified bits within one long, etc.
|
|
|
|
The blitter compilation engine marches through the Region scanline
|
|
from left to right, calling code generation functions as it goes. The
|
|
generated code is accrued into a 32-byte aligned buffer on the stack.
|
|
In this way, the blitter constructs a subroutine to draw the Region.
|
|
|
|
The compilation engine isn't very complicated. The tricky part is the
|
|
all of code generation subroutines, which need to be fast since they
|
|
are called so often and easy to write since there are so many of them.
|
|
For each drawing mode there's one for each case the compilation engine
|
|
cares about. For pattern drawing modes, there are separate
|
|
specialized routines for cases like patterns that can be entirely
|
|
expressed in one 32-bit value ("short/narrow") patterns, patterns
|
|
which can be expressed as one 32-bit value for each row, but which
|
|
vary per row ("tall/narrow"), as well as "wide" variants of both.
|
|
Beyond that, there are some versions specialized for 80486 and higher
|
|
processors (which have the "bswap" instruction).
|
|
|
|
This is where the Perl meta-assembler comes into play.
|
|
|
|
The meta-assembler takes as input an assembly language template, and
|
|
generates as output Pentium-scheduled assembly code that outputs an
|
|
80x86 binary for the input template. Got it? This can be a little
|
|
confusing, so a few examples are in order.
|
|
|
|
Here is perhaps the simplest template:
|
|
|
|
@meta copy_short_narrow_1
|
|
movl %eax,@param_offset@(%edi)
|
|
@endmeta
|
|
|
|
The meta-assembler processes that into this 80x86 assembly code:
|
|
|
|
.align 4,0x90
|
|
.globl _xdblt_copy_short_narrow_1
|
|
_xdblt_copy_short_narrow_1:
|
|
movw $0x8789,(%edi)
|
|
movl %eax,2(%edi)
|
|
addl $6,%edi
|
|
ret
|
|
|
|
This subroutine, which gets called by the blitter compilation engine,
|
|
generates the binary for the input assembly template. It writes the
|
|
raw binary for the movl instruction specified in the template to the
|
|
address specified by %edi.
|
|
|
|
Let's take a look at a far more complicated template. This template
|
|
handles the case where we want to bitwise OR a pattern to the
|
|
destination bitmap, and the number of longs to transfer equals zero
|
|
mod 4 (e.g. if the blitter wants to OR 36 longs to memory):
|
|
|
|
@meta or_short_narrow_many_mod_0
|
|
addl $@param_offset@,%edi
|
|
movl $@param_long_count_div_4@,%ecx
|
|
1: orl %eax,(%edi)
|
|
orl %eax,4(%edi)
|
|
orl %eax,8(%edi)
|
|
orl %eax,12(%edi)
|
|
addl $16,%edi
|
|
decl %ecx
|
|
jnz 1b
|
|
@lit leal (%eax,%edx,4),%ecx
|
|
@lit addl %ecx,edi_offset
|
|
@endmeta
|
|
|
|
The meta-assembler compiles that to this:
|
|
|
|
.align 4,0x90
|
|
.globl _xdblt_or_short_narrow_many_mod_0
|
|
_xdblt_or_short_narrow_many_mod_0:
|
|
movw $0xC781,(%edi)
|
|
movl %eax,2(%edi)
|
|
movl $0x47090709,11(%edi)
|
|
movb $0xB9,6(%edi)
|
|
movl $0x8470904,15(%edi)
|
|
movl $0x754910C7,23(%edi)
|
|
movl $0x830C4709,19(%edi)
|
|
movb $0xEF,27(%edi)
|
|
movl %edx,%ecx
|
|
shrl $2,%ecx
|
|
movl %ecx,7(%edi)
|
|
addl $28,%edi
|
|
leal (%eax,%edx,4),%ecx
|
|
addl %ecx,edi_offset
|
|
ret
|
|
|
|
Again, this mechanically generated subroutine generates the executable
|
|
80x86 binary for the "or_short_narrow_many_mod_0" template. It gets
|
|
called by the blitter compilation engine when it needs code to OR a
|
|
bunch of longs to memory.
|
|
|
|
Even though this subroutine is longer than the previous example, it
|
|
still doesn't take very long to execute. Furthermore, it only gets
|
|
called when the blitter has determined that many longs are to be ORed
|
|
to memory, so the time taken actually blitting to memory will
|
|
typically dwarf the time taken to execute these 15 instructions.
|
|
|
|
The meta-assembler is a Perl script that works by running numerous
|
|
syntactically modified versions of the assembly template through
|
|
"gas", the GNU assembler, and examining the output bytes to discover
|
|
which bits are fixed opcode bits and which bits correspond to
|
|
operands. Once it has figured out what goes where, it generates 80x86
|
|
assembly code which writes out the constant bytes and computes and
|
|
writes out the operand bytes. That code is run through a simple
|
|
Pentium instruction scheduler and the meta-assembler is done.
|
|
|
|
|
|
Portable:
|
|
|
|
|
|
Although the meta-assembler-based blitter works only on 80x86
|
|
processors, Executor itself can run on non-Intel processors. On other
|
|
CPUs (such as the 68040 used in the NeXTstation) Executor's blitter
|
|
works somewhat differently.
|
|
|
|
The basic idea is still the same: translate Region scanlines into an
|
|
efficient form once and then use that efficient form each time the
|
|
scanline gets drawn. This time, however, the "efficient form" is
|
|
processor independent, and the blitter is written entirely in C.
|
|
|
|
As is the case with the 80x86-specific blitter, the portable blitter
|
|
compilation engine examines scanline start/stop pairs and identifies
|
|
which of several cases is appropriate. One case is "output three
|
|
longs", another is "output only certain pixels within the current
|
|
long", and so on.
|
|
|
|
Like the 80x86-specific blitter, the particular case encountered
|
|
determines which entry in a lookup table will be used. But there the
|
|
similarity ends. The lookup tables contain pointers to C code labels
|
|
rather than to routines that generates 80x86 code on the fly.
|
|
|
|
[FIXME: the following would be best as a footnote]
|
|
|
|
"What the heck is a pointer to a C code label?", you ask? gcc (the
|
|
GNU C compiler) has a "pointer to label" extension to the C language
|
|
which makes the statement "&&my_label" evaluate to a "void *" that
|
|
points to the compiled code for "my_label:" within a C function.
|
|
This, combined with gcc's "goto void *" extension, allows C programs
|
|
to execute goto statements whose destinations are not known at compile
|
|
time.
|
|
|
|
Each scanline gets translated into an array of opcodes for the
|
|
"blitter opcode interpreter" (which will be described below). Each
|
|
opcode is stored in one of these C structs:
|
|
|
|
struct
|
|
{
|
|
const void *label; /* Pointer to C code to handle this opcode. */
|
|
int32 offset; /* Offset into scanline to start. */
|
|
int32 arg; /* Extra operand with different uses. */
|
|
};
|
|
|
|
For example, consider the case where the blitter wants to write out
|
|
five contiguous longs from a "simple" pattern starting 64 bytes into
|
|
the current row. In this case, "label" would equal
|
|
"&©_short_narrow_many_5", "offset" would equal 64, and "arg" would
|
|
equal 5.
|
|
|
|
|
|
The blitter opcode interpreter
|
|
|
|
|
|
The blitter opcode interpreter is machine generated C code created by
|
|
a Perl script when Executor is compiled. That Perl script takes as
|
|
input C code snippets that tell it how to handle particular drawing
|
|
modes, and produces as output C code for an interpreter.
|
|
|
|
Here is the template taken as input by the Perl script for the
|
|
"copy_short_narrow" case. This is the simple case where the pixels
|
|
for the pattern being displayed can be stored entirely within one
|
|
32-bit long (for example, solid white or solid black).
|
|
|
|
begin_mode copy_short_narrow max_unwrap
|
|
repeat @dst@ = v;
|
|
mask @dst@ = (@dst@ & ~arg) | (v & arg);
|
|
end_mode
|
|
|
|
The "repeat" field tells the Perl script what C code to generate for
|
|
the simple case where all pixels in the destination long are to be
|
|
affected. The "mask" case tells it what to do when it must only
|
|
modify certain bits in the target long and must leave others alone.
|
|
|
|
The generated interpreter takes as input an array of blitter opcode
|
|
structs, which it then proceeds to interpret once for each row to be
|
|
drawn.
|
|
|
|
Here is the section of the (machine-generated) interpreter that
|
|
handles the copy_short_narrow cases. Remember that each "blitter
|
|
opcode" is really just a pointer to one of these C labels. This code
|
|
would get used when filling a rectangle with a solid color.
|
|
|
|
copy_short_narrow_mask:
|
|
*dst = (*dst & ~arg) | (v & arg);
|
|
JUMP_TO_NEXT;
|
|
copy_short_narrow_many_loop:
|
|
dst += 8;
|
|
copy_short_narrow_many_8:
|
|
dst[0] = v;
|
|
copy_short_narrow_many_7:
|
|
dst[1] = v;
|
|
copy_short_narrow_many_6:
|
|
dst[2] = v;
|
|
copy_short_narrow_many_5:
|
|
dst[3] = v;
|
|
copy_short_narrow_many_4:
|
|
dst[4] = v;
|
|
copy_short_narrow_many_3:
|
|
dst[5] = v;
|
|
copy_short_narrow_many_2:
|
|
dst[6] = v;
|
|
copy_short_narrow_many_1:
|
|
dst[7] = v;
|
|
if ((arg -= 8) > 0)
|
|
goto copy_short_narrow_many_loop;
|
|
JUMP_TO_NEXT;
|
|
|
|
Note how the inner blitting loop is "unwrapped" for speed. A blitter
|
|
opcode would specify that 39 longs are to be output by making its
|
|
"arg" field be 39 and the "label" field point to
|
|
"copy_short_narrow_many_3", in the middle of the unwrapped loop. The
|
|
interpreter would jump there and loop until all of the pixels had been
|
|
written out, at 32 bytes per loop iteration. This is very fast,
|
|
especially for portable code.
|
|
|
|
Of course, if any other pixels needed to be drawn, there would be
|
|
additional blitter opcode structs telling the interpreter what to do.
|
|
The interpreter dispatches to the next opcode by executing the
|
|
"JUMP_TO_NEXT" macro, which automatically does a "goto" to the C label
|
|
that handles the next opcode.
|