mac-rom/OS/MemoryMgr/BlockMove.a
Elliot Nunn 0ba83392d4 Bring in CubeE sources
Resource forks are included only for .rsrc files. These are DeRezzed into their data fork. 'ckid' resources, from the Projector VCS, are not included.

The Tools directory, containing mostly junk, is also excluded.
2017-09-20 18:04:16 +08:00

696 lines
33 KiB
Plaintext

;
; File: BlockMove.a
;
; Contains: Here is the optimized Mac block move routine. It handles overlapping
; blocks by moving left or right when appropriate. It uses a MOVE.L
; loop when possible and uses a 12 register MOVEM.L loop for blocks
; longer than 124 bytes.
;
; Written by: Andy Hertzfeld
; Re-Written by: Gary Davidian
;
; Copyright: © 1982-1993 by Apple Computer, Inc., all rights reserved.
;
; Change History (most recent first):
;
; <SM5> 5/18/93 kc Roll in Gary's clean up and large overlap bug fix.
; <SM4> 4/23/93 kc Add "IF Supports24Bit" arround "and.l MaskBC,d1" in Block Move.
; <SM3> 11/10/92 CSS Update from Reality:
; <12> 10/23/92 pvh Reset d1 from d2 for byte copies less than 12 bytes (veryShort
; needs it)
; <11> 10/15/92 DTY Use D2 for the 12 byte or less check so we can preserve the trap
; word in D1, which is now checked to see if weÕre doing
; BlockMove, or BlockMoveData. BlockMoveData, which doesnÕt flush
; the cache, is signaled by having bit 9 (the immediate bit) set
; in the trap word.
; <SM2> 10/16/92 RB Removed the jCacheFlush call from the 68040 BlockMove, that code
; is never executed. (Horror has the same code, so don't bring it
; over!)
; <10> 2/12/92 JSM Moving to MemoryMgr directory, keeping all revisions.
; <9> 2/6/92 RB Fixed bug in 68040 version of BlockMove. The bug was introduced
; (by me) while moving the code from Terror into Reality.
; <8> 1/3/92 RB Rolled in 68040 version of BlockMove from Terror. It gets
; installed from StartInit.a when an 040 is present.
; <7> 10/18/91 JSM Remove 68000 versions.
; <6> 8/29/91 JSM Cleanup header.
; <5> 9/18/90 BG Removed <2>, <4>. 040s are behaving more reliably now.
; <4> 8/3/90 BG Added some EclipseNOPs for flakey 040s. Currently 040s require
; ANY instruction to separate two adjacent MOVEMs.
; <3> 7/17/90 dba change name of BlockMove routine so it does not conflict with
; the BlockMove glue
; <2> 6/18/90 CCH Added NOPs for flaky 68040's.
; <1.6> 7/15/89 GGD Added code alignment for better burst performance.
; <1.5> 2/22/89 GGD Made the 68020 version work in 32 bit mode as well as 24 bit
; mode, and work with move counts greater than 2**24.
; <1.4> 2/20/89 rwh re-spelled conditional in comment, won't show up in searches.
; <1.3> 12/6/88 GGD Fixed a incorrect register bug which in the 68000 decrementing
; path.
; <1.2> 11/17/88 GGD Re-written, and optimized, although algorithms still basicly the
; same for 68000 machines. Now the decrementing copy loops are
; only used if there is overlap between the source and dest, since
; the incrementing address modes are faster on the 68000, also
; optimized the MOVEM loop. 68000 version also includes the 68020
; version, and the correct version is chosen at start time based
; upon CpuFlag. This way accelerated machines will get faster
; moves and correct cache flushing. For 68020 got rid of single
; byte at a time move loop because the 020 can read words from odd
; addresses. Also longword aligned the destination to reduce bus
; cycles. Special cased very short (up to 12 byte) copies to read
; the entire source, and then write it out so that overlap need
; not be checked, and misalignment doesn't cost too much. It now
; only flushes the instruction cache when the length of the move
; is greater than 12 bytes.
; <1.1> 11/10/88 CCH Fixed Header.
; <1.0> 11/9/88 CCH Adding to EASE.
; <¥1.1> 9/23/88 CCH Got rid of inc.sum.d and empty nFiles
; <1.0> 2/10/88 BBM Adding file for the first time into EASEÉ
; <Cxxx> 10/16/87 rwh Port to Modern Victorian (onMvMac)
; <C690> 1/24/87 JTC Improvements for 020. With new longword alignment in 020
; <C668> 1/22/87 bbm made the code which flushed the cache a external vector.
; <C482> 12/4/86 bbm The code to flush the cache in blockmove needed to be set in
; conditionals for NuMac. <1.4>
; <C456> 11/22/86 bbm moved the code to flush the cache into blockmove, loadseg,
; unloadseg, and read. this might improve performance.
; <C206> 10/9/86 bbm Made file use mpw aincludes.
; 2/19/86 BBM Made some modifications to work under MPW
; 4/23/85 SC Insure D0=0 upon exit(I told you there was an bug)
; 4/20/85 JTC Added .DEF for routine name!
; 4/16/85 SC Rewrote with no space consideration and new blockmove
; statistics. On the average, this is 30% faster than old one.
; 1/29/85 EHB Check for negative lengths too!!
; 1/23/85 LAK Adapted for new equate files.
; 8/18/83 JTC Hacked for space by JTC
; 3/9/83 AJH Fixed bug by making it add long for moving right
; 10/31/82 AJH Integrated for ROM
; 8/26/82 AJH Modified it to support blocks > 64K
; 5/19/82 AJH fixed bug Malloy found in cleaning up after MOVEM loop
; 5/12/82 AJH re-organized things to make it cleaner
;
;
;_______________________________________________________________________
;
; BlockMove(SrcPtr,DstPtr: Ptr; nBytes: INTEGER);
;
; Here is the optimized Mac block move routine. It handles overlapping
; blocks by moving left or right when appropriate. It uses a MOVE.L
; loop when possible and uses a 12 register MOVEM.L loop for blocks
; longer than 124 bytes.
;
; It uses a register interface. A0 = source, A1 = destination, D0 = count.
; The addresses are firsted masked with $00FFFFFF.
;
; Register mask during computation:
; D0 = count
; D1 = destination pointer - source pointer
; A0 = source pointer
; A1 = destination pointer
;
; written by Andy Hertzfeld May 10, 1982
; re-written by Gary Davidian Nov 13, 1988
;
; Copyright Apple Computer, Inc. 1982-1989
; All Rights Reserved
;
; Ancient Modification History: (for historical purposes only, does not correspond to this code)
;
; 12-May-82 AJH re-organized things to make it cleaner
; 19-May-82 AJH fixed bug Malloy found in cleaning up after MOVEM loop
; 26-Aug-82 AJH Modified it to support blocks > 64K
; 31-Oct-82 AJH Integrated for ROM
; 09-Mar-83 AJH Fixed bug by making it add long for moving right
; 18-Aug-83 JTC Hacked for space by JTC
;
;_______________________________________________________________________
; 23 Jan 85 LAK Adapted for new equate files.
; 29-Jan-85 EHB Check for negative lengths too!!
; 16-Apr-85 SC Rewrote with no space consideration and new blockmove
; statistics. On the average, this is 30% faster than
; old one.
; 20 Apr 85 JTC Added .DEF for routine name!
; 23 Apr 85 SC Insure D0=0 upon exit(I told you there was an bug)
;_______________________________________________________________________
;
; Post Lonely Hearts
;_______________________________________________________________________
;
; <19feb86> BBM Made some modifications to work under MPW
;<C206/09oct86> bbm Made file use mpw aincludes.
;<C456/22nov86> bbm moved the code to flush the cache into blockmove, loadseg,
; unloadseg, and read. this might improve performance.
;<C482/04dec86> bbm The code to flush the cache in blockmove needed to be set
; in conditionals for NuMac. <1.4>
;<C668/22jan87> bbm made the code which flushed the cache a external vector.
;<C690/24jan87> JTC Improvements for 020. With new longword alignment in 020
; memory managers, itÕs worthwhile to take advantage of the fastest possible
; move, an unrolled dbra loop of MOVE.Ls. We arbitrarily choose 16 moves,
; since the dbra overhead looks like 6 cycles based on work with Ron H.
; The 16 cases of interest are moves from (4N+K) to (4M+J) where M and
; N are nonnegative and 0 ² K,J ² 3. As before, lump all even/odd cases
; into one big dbra thrash by bytes.
;<Cxxx/16oct87> rwh Port to Modern Victorian (onMvMac)
;_______________________________________________________________________
; Interesting numbers from psuedo-random sampling:
; 80%+ calls are for 1-31 bytes
; 95%+ calls are for less than 256 bytes
; On a 512K Mac, 20% of the calls come from memory manager
; On a 128K Mac, 40% of the calls come from memory manager
; => this probably should be JSR'ed to from Memory Manager
print off
LOAD 'StandardEqu.d'
print on
print nomdir
machine mc68040
BlockMoves proc
export __BlockMove ; Default version
export BlockMove68020 ; 68020 version (flushes cache too)
EXPORT BlockMove68040 ; 68040 version <8> rb
eject
align alignment
Loop16 move16 (a0)+,(a1)+ ; move 32 bytes, 16 at a time
sub.l d2,d0 ; adjust for the 32 bytes just moved
move16 (a0)+,(a1)+
bge.s Loop16 ; loop until count is -32É-1
jmp CopyTailInc(d0.w*2) ; copy the remaining bytes
Title 'BlockMove - Copy Tail Incrementing'
;_______________________________________________________________________
;
; Routine: CopyTailInc
; Inputs: A0 - source address
; A1 - destination address
; Outputs: D0 - error code (noErr)
; Destroys: A0, A1
;
; Function: Copy up to 31 bytes in incrementing address order using a direct
; sequence of moves. This routine returns to the BlockMove caller
; with D0=noErr.
;
; Calling Convention:
; D0 is setup with size-32, so that moving 0É31 bytes => d0 = -32É-1
; The trick is to double D0 and use it as an index into a table of
; branches to the appropriate code. Thanks to Steve Capps for all this.
;
; 68000 add.w d0,d0 68020 jmp CopyTailInc(d0.w*2)
; jmp CopyTailInc(d0.w)
;
;_______________________________________________________________________
TailInc30 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 30 bytes Incrementing
TailInc26 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 26 bytes Incrementing
TailInc22 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 22 bytes Incrementing
TailInc18 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 18 bytes Incrementing
TailInc14 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 14 bytes Incrementing
TailInc10 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 10 bytes Incrementing
TailInc06 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 6 bytes Incrementing
TailInc02 move.w (a0)+,(a1)+ ; 12 1 1 1 copy final 2 bytes Incrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
bra.s TailInc00 ; 10 2 0 0 copy final 0 bytes Incrementing
bra.s TailInc01 ; 10 2 0 0 copy final 1 byte Incrementing
bra.s TailInc02 ; 10 2 0 0 copy final 2 bytes Incrementing
bra.s TailInc03 ; 10 2 0 0 copy final 3 bytes Incrementing
bra.s TailInc04 ; 10 2 0 0 copy final 4 bytes Incrementing
bra.s TailInc05 ; 10 2 0 0 copy final 5 bytes Incrementing
bra.s TailInc06 ; 10 2 0 0 copy final 6 bytes Incrementing
bra.s TailInc07 ; 10 2 0 0 copy final 7 bytes Incrementing
bra.s TailInc08 ; 10 2 0 0 copy final 8 bytes Incrementing
bra.s TailInc09 ; 10 2 0 0 copy final 9 bytes Incrementing
bra.s TailInc10 ; 10 2 0 0 copy final 10 bytes Incrementing
bra.s TailInc11 ; 10 2 0 0 copy final 11 bytes Incrementing
bra.s TailInc12 ; 10 2 0 0 copy final 12 bytes Incrementing
bra.s TailInc13 ; 10 2 0 0 copy final 13 bytes Incrementing
bra.s TailInc14 ; 10 2 0 0 copy final 14 bytes Incrementing
bra.s TailInc15 ; 10 2 0 0 copy final 15 bytes Incrementing
bra.s TailInc16 ; 10 2 0 0 copy final 16 bytes Incrementing
bra.s TailInc17 ; 10 2 0 0 copy final 17 bytes Incrementing
bra.s TailInc18 ; 10 2 0 0 copy final 18 bytes Incrementing
bra.s TailInc19 ; 10 2 0 0 copy final 19 bytes Incrementing
bra.s TailInc20 ; 10 2 0 0 copy final 20 bytes Incrementing
bra.s TailInc21 ; 10 2 0 0 copy final 21 bytes Incrementing
bra.s TailInc22 ; 10 2 0 0 copy final 22 bytes Incrementing
bra.s TailInc23 ; 10 2 0 0 copy final 23 bytes Incrementing
bra.s TailInc24 ; 10 2 0 0 copy final 24 bytes Incrementing
bra.s TailInc25 ; 10 2 0 0 copy final 25 bytes Incrementing
bra.s TailInc26 ; 10 2 0 0 copy final 26 bytes Incrementing
bra.s TailInc27 ; 10 2 0 0 copy final 27 bytes Incrementing
bra.s TailInc28 ; 10 2 0 0 copy final 28 bytes Incrementing
bra.s TailInc29 ; 10 2 0 0 copy final 29 bytes Incrementing
bra.s TailInc30 ; 10 2 0 0 copy final 30 bytes Incrementing
TailInc31 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 31 bytes Incrementing
CopyTailInc ; copy final 0É31 bytes Incrementing
TailInc27 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 27 bytes Incrementing
TailInc23 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 23 bytes Incrementing
TailInc19 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 19 bytes Incrementing
TailInc15 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 15 bytes Incrementing
TailInc11 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 11 bytes Incrementing
TailInc07 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 7 bytes Incrementing
TailInc03 move.w (a0)+,(a1)+ ; 12 1 1 1 copy final 3 bytes Incrementing
TailInc01 move.b (a0)+,(a1)+ ; 12 1 1 1 copy final 1 byte Incrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
TailInc28 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 28 bytes Incrementing
TailInc24 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 24 bytes Incrementing
TailInc20 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 20 bytes Incrementing
TailInc16 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 16 bytes Incrementing
TailInc12 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 12 bytes Incrementing
TailInc08 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 8 bytes Incrementing
TailInc04 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 4 bytes Incrementing
TailInc00 moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
TailInc29 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 29 bytes Incrementing
TailInc25 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 25 bytes Incrementing
TailInc21 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 21 bytes Incrementing
TailInc17 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 17 bytes Incrementing
TailInc13 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 13 bytes Incrementing
TailInc09 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 9 bytes Incrementing
TailInc05 move.l (a0)+,(a1)+ ; 20 1 2 2 copy final 5 bytes Incrementing
move.b (a0)+,(a1)+ ; 12 1 1 1 copy final 1 byte Incrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
Title 'BlockMove - Copy Incrementing 68020 / 68030'
align alignment ; <1.6>
CopyInc68020
; Check to see if the source and destination overlap such that a copy using
; incrementing addresses would cause modification of the source, in which case
; we must copy starting at the end of the fields, using decrementing addresses.
move.l jCacheFlush,-(sp) ; flush the instruction cache when we exit
move.l a1,d1 ; get the destination address
moveq.l #-4,d2 ; setup mask for longword alignment
or.l d1,d2 ; d2= -1É-4, number of bytes to align
sub.l a0,d1 ; d1 := dest - src
IF Supports24Bit THEN
; Mask the address difference to 24 bits before comparing to length, this is needed for
; 24 bit mode where the upper byte of the addresses may have flags. This even works in
; 32 bit mode as long as the byte count doesn't exceed 24 bits, although in a 32 bit only
; system, this masking can be eliminated.
andi.l #$00FFFFFF,d1 ; strip off the high byte for 24 bit mode
cmp.l d0,d1 ; see if dest is before end of src
blo.s overlap ; if so, fields overlap, must copy decrementing addrs
ELSE
cmp.l d0,d1 ; see if dest is before end of src
blo.w CopyDec68020 ; if so, fields overlap, must copy decrementing addrs
ENDIF
; Align the destination to a longword boundary to reduce bus cycles. On a 68030
; with the data cache, the unaligned reads will cache, so that the same long word
; will not need to be read from RAM more than once. At this point we also know that
; we are moving more than 12 bytes, so that the 0É3 bytes of alignment will not
; exceed the length.
moveLongs ; <8> rb
jmp align(d2.w*2) ; jump to the alignment routine <8> rb
bra.s aligned ; -4, already longword aligned <8> rb
move.b (a0)+,(a1)+ ; -3, move 3 bytes to align
move.b (a0)+,(a1)+ ; -2, move 2 bytes to align
move.b (a0)+,(a1)+ ; -1, move 1 byte to align
align add.l d2,d0 ; adjust the byte count after alignment <8> rb
aligned moveq.l #32,d2 ; byte count adjustment, 32 bytes at a time <8> rb
sub.l d2,d0 ; setup for CopyTailInc
bge.s longsLoop ; if 32 or more bytes left, use longword loop <8> rb
jmp CopyTailInc(d0.w*2) ; copy the remaining bytes
longsLoop move.l (a0)+,(a1)+ ; move 32 bytes, 4 at a time <8> rb
move.l (a0)+,(a1)+ ; (using a DBRA in this loop would save
move.l (a0)+,(a1)+ ; 2 clocks per loop, but would incur
move.l (a0)+,(a1)+ ; overhead outside of the loop that would
move.l (a0)+,(a1)+ ; slow down the shorter and more frequent
move.l (a0)+,(a1)+ ; cases)
move.l (a0)+,(a1)+ ;
move.l (a0)+,(a1)+ ;
sub.l d2,d0 ; adjust for the 32 bytes just moved
bge.s longsLoop ; loop until count is -32É-1 <8> rb
jmp CopyTailInc(d0.w*2) ; copy the remaining bytes
IF Supports24Bit THEN
overlap cmpi.l #$007FFFFF,d0 ; see if length > 23 bits
bls.w CopyDec68020 ; if small len, don't care about address mode
; If length > 2**23 bits, we may be in 32 bit mode, and it is possible that <1.5>
; the source and destination differ by more than 2**23. In which case the <1.5>
; masked address difference above would not be accurate for detecting overlap.<1.5>
move.l a1,-(sp) ; push dst address
move.l a0,-(sp) ; push src address
btst.b #Systemis24bit,SystemInfo ; are we in 24 bit mode?
beq.s @32bit ; nope
clr.b 4(sp) ; strip dst address
clr.b (sp) ; strip src address
@32bit cmpm.l (sp)+,(sp)+ ; see if source <= dest (dec move needed)
bhs.w CopyDec68020 ; if so, start copy decrementing addrs <1.5>
jmp align(d2.w*2) ; go to the incr. alignment routine <1.5> <8> rb
ENDIF
Title 'BlockMove - 68020 / 68030 Block Move'
;_______________________________________________________________________
;
; Routine: BlockMove68020
; Inputs: A0 - source address
; A1 - destination address
; D0 - byte count
; Outputs: D0 - error code (noErr)
; Destroys: A0, A1, D1, D2
;
; Function: 68020/68030 block move routine, checks source, destination, and
; length to determine if the fields overlap in such a way that
; the direction of the copy should be changed to start at the
; end of the field use decrementing addresses. The default
; and preferred order is to copy from the beginning of the
; fields and use incrementing addresses. There is also a special
; case for moves of up to 12 bytes, in which case we read the entire
; source field into registers before writing it to the destination
; so that we don't need to check for overlap.
;
;_______________________________________________________________________
align alignment ; <1.6>
__BlockMove ; ENTER HERE
BlockMove68020 ; ENTER HERE
moveq.l #-12,d2 ; special case length is 1É12
add.l d0,d2 ; see if length <= 12
bgt.s CopyInc68020 ; if not, normal case, try incrementing first
veryShort tst.l d0 ; check byte count <8> rb
ble.s @done ; if count was negative or zero, we're done
; if count was 1É12, we can read the entire source into registers and then write it
; out to the destination without having to worry about overlap, since all reads are
; completed before any writes start. Since the count is so small, it's unlikely that
; we are moving code, so to improve performance, don't flush the instruction cache.
lsl.w #3,d0 ; convert byte count to bit count
addq.l #4,d2 ; see if 1É8 / 9É12
ble.s @short1to8 ; if count <= 8, check for 1É4 / 5É8
@short9to12 addq.l #8,a0 ; point to final 1É4 bytes from source
bfextu (a0){0:d0},d1 ; read final 1É4 bytes from source
move.l -(a0),d2 ; read middle 4 bytes from source
move.l -(a0),(a1)+ ; copy first 4 bytes from source to dest
@done5to8 move.l d2,(a1)+ ; write middle (or first) 4 bytes to destination
@done1to4 bfins d1,(a1){0:d0} ; write final 1É4 bytes to destination
@done moveq.l #noErr,d0 ; return success status
rts ; _BlockMove complete (don't flush cache)
@short1to8 addq.l #4,d2 ; see if 1É4 / 5É8
ble.s @short1to4 ; if count <= 4, go move it
@short5to8 move.l (a0)+,d2 ; read first 4 bytes from source
bfextu (a0){0:d0},d1 ; read final 1É4 bytes from source
bra.s @done5to8 ; write to destination and exit
@short1to4 bfextu (a0){0:d0},d1 ; read 1É4 bytes from source
bra.s @done1to4 ; write to destination and exit
Title 'BlockMove - 68040 MOVE16 optimizations'
; <8> rb
;========================== 68040 BlockMove =========================
; Terror History of 68040 BlockMove:
;
;
; <7> 5/10/91 RP Rolled in GGDs MOVE16 BlockMove patch.
; <6> 4/25/91 CCH Removed NOPs, since the bug was the unsetup DFC, not the 68040.
; <5> 4/25/91 CCH Set DFC register before doing a PTEST.
; <4> 4/24/91 CCH Added a NOP in front of the CPUSHL instructions since they
; currently don't always work in the D43B mask set of 68040's.
; <3> 4/21/91 CCH Fixed BlockMove patch to convert addresses from logical to
; physical when using CPUSHL to flush.
; <2> 4/2/91 CCH Don't optimize if VM is on.
; <1> 4/2/91 CCH first checked in
;_______________________________________________________________________
;
; Routine: BlockMove68040
; Inputs: A0 - source address
; A1 - destination address
; D0 - byte count
; D1 - trap word: DonÕt flush the cache if immediate bit is set.
; Outputs: D0 - error code (noErr)
; Destroys: A0, A1, D1, D2
;
; Function: 68040 block move routine. If length > 12 bytes, sets up
; cache flushing tail routine. Otherwise, tries to use MOVE16
; if possible. If MOVE16 not possible, uses standard 68030
; MOVE.L loops.
;
;_______________________________________________________________________
align alignment
IMPORT FlushCRangeForBM
BlockMove68040 ; ENTER HERE
moveq.l #-12,d2 ; special case length is 1É12 <11> Use D2 to preserve trap word in D1
add.l d0,d2 ; see if length <= 12 <11> Use D2 to preserve trap word in D1
ble.s veryShort ; use fast path if length <= 12
CopyInc68040
btst #noQueueBit,d1 ; <11> Check to see if the immediate bit is set.
bnz.s @checkAlignment ; <11> If itÕs set, donÕt flush the cache on exit
move.l d0,-(sp) ; Count parameter for FlushCRange
move.l a1,-(sp) ; Address parameter for FlushCRange
bsr.s @checkAlignment ; move the data
bra.l FlushCRangeForBM ; flush the caches
@checkAlignment
; Check to see if the source and destination overlap such that a copy using
; incrementing addresses would cause modification of the source, in which case
; we must copy starting at the end of the fields, using decrementing addresses.
move.l a1,d1 ; get the destination address
moveq.l #-4,d2 ; setup mask for longword alignment
or.l d1,d2 ; d2= -1É-4, number of bytes to align
sub.l a0,d1 ; d1 := dest - src
IF Supports24Bit THEN
; Mask the address difference to 24 bits before comparing to length, this is needed for
; 24 bit mode where the upper byte of the addresses may have flags. This even works in
; 32 bit mode as long as the byte count doesn't exceed 24 bits, although in a 32 bit only
; system, this masking can be eliminated.
andi.l #$00FFFFFF,d1 ; strip off the high byte for 24 bit mode
cmp.l d0,d1 ; see if dest is before end of src
blo.w overlap ; if so, fields overlap, must copy decrementing addrs
ELSE
cmp.l d0,d1 ; see if dest is before end of src
blo.w CopyDec68020 ; if so, fields overlap, must copy decrementing addrs
ENDIF
; Align the destination to a longword boundary to reduce bus cycles. On a 68030
; with the data cache, the unaligned reads will cache, so that the same long word
; will not need to be read from RAM more than once. At this point we also know that
; we are moving more than 12 bytes, so that the 0É3 bytes of alignment will not
; exceed the length.
cmpi.l #47,d0 ; see if long enough
blo.w moveLongs ; if not, don't even think of Move16s
andi.b #$0F,d1 ; see if 16 byte relative alignment
bne.w moveLongs ; if not, can't use Move16
; Align the source / destination to a 16 byte boundary to use MOVE16. At this
; point we also know that we are moving at least 47 bytes, so that the 0É15 bytes
; of alignment will not exceed the length.
@UseMove16 move.l a1,d1 ; get the destination address
neg.l d1 ; convert to byte count
moveq.l #15,d2 ; setup mask for longword alignment
and.l d1,d2 ; d2= 0É15, number of bytes to align
beq.s @Aligned16 ; exit if no alignment needed
sub.l d2,d0 ; update byte count
lsr.l #1,d2 ; test bit 0 of alignment count
bcc.s @Aligned2 ; skip if already byte aligned
move.b (a0)+,(a1)+ ; move 1 byte to force word alignment
@Aligned2 lsr.l #1,d2 ; test bit 1 of alignment count
bcc.s @Aligned4 ; skip if already long aligned
move.w (a0)+,(a1)+ ; move 1 word to force long alignment
@Aligned4 lsr.l #1,d2 ; test bit 2 of alignment count
bcc.s @Aligned8 ; skip if already long aligned
move.l (a0)+,(a1)+ ; move 1 long to force double alignment
tst.l d2 ; test bit 3 of alignment count
@Aligned8 beq.s @Aligned16 ; skip if already long aligned
move.l (a0)+,(a1)+ ; move 1 double to force quad alignment
move.l (a0)+,(a1)+
@Aligned16 moveq.l #32,d2 ; byte count adjustment, 32 bytes at a time
sub.l d2,d0 ; setup for CopyTailInc
bclr.l #4,d0 ; see if tail >= 16 bytes
nop ; sync the pipeline for defective 68040s
beq.w Loop16 ; if not, start copy
move16 (a0)+,(a1)+ ; extra move16 to reduce tail size
bra.w Loop16 ; align the loop on a cache line
Title 'BlockMove - Copy Tail Decrementing'
;_______________________________________________________________________
;
; Routine: CopyTailDec
; Inputs: A0 - source address+1
; A1 - destination address+1
; Outputs: D0 - error code (noErr)
; Destroys: A0, A1
;
; Function: Copy up to 31 bytes in decrementing address order using a direct
; sequence of moves. This routine returns to the BlockMove caller
; with D0=noErr.
;
; Calling Convention:
; D0 is setup with size-32, so that moving 0É31 bytes => d0 = -32É-1
; The trick is to double D0 and use it as an index into a table of
; branches to the appropriate code. Thanks to Steve Capps for all this.
;
; 68000 add.w d0,d0 68020 jmp CopyTailDec(d0.w*2)
; jmp CopyTailDec(d0.w)
;
;_______________________________________________________________________
align alignment ; <1.6>
TailDec30 move.l -(a0),-(a1) ; 22 1 2 2 copy final 30 bytes Decrementing
TailDec26 move.l -(a0),-(a1) ; 22 1 2 2 copy final 26 bytes Decrementing
TailDec22 move.l -(a0),-(a1) ; 22 1 2 2 copy final 22 bytes Decrementing
TailDec18 move.l -(a0),-(a1) ; 22 1 2 2 copy final 18 bytes Decrementing
TailDec14 move.l -(a0),-(a1) ; 22 1 2 2 copy final 14 bytes Decrementing
TailDec10 move.l -(a0),-(a1) ; 22 1 2 2 copy final 10 bytes Decrementing
TailDec06 move.l -(a0),-(a1) ; 22 1 2 2 copy final 6 bytes Decrementing
TailDec02 move.w -(a0),-(a1) ; 14 1 1 1 copy final 2 bytes Decrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
bra.s TailDec00 ; 10 2 0 0 copy final 0 bytes Decrementing
bra.s TailDec01 ; 10 2 0 0 copy final 1 byte Decrementing
bra.s TailDec02 ; 10 2 0 0 copy final 2 bytes Decrementing
bra.s TailDec03 ; 10 2 0 0 copy final 3 bytes Decrementing
bra.s TailDec04 ; 10 2 0 0 copy final 4 bytes Decrementing
bra.s TailDec05 ; 10 2 0 0 copy final 5 bytes Decrementing
bra.s TailDec06 ; 10 2 0 0 copy final 6 bytes Decrementing
bra.s TailDec07 ; 10 2 0 0 copy final 7 bytes Decrementing
bra.s TailDec08 ; 10 2 0 0 copy final 8 bytes Decrementing
bra.s TailDec09 ; 10 2 0 0 copy final 9 bytes Decrementing
bra.s TailDec10 ; 10 2 0 0 copy final 10 bytes Decrementing
bra.s TailDec11 ; 10 2 0 0 copy final 11 bytes Decrementing
bra.s TailDec12 ; 10 2 0 0 copy final 12 bytes Decrementing
bra.s TailDec13 ; 10 2 0 0 copy final 13 bytes Decrementing
bra.s TailDec14 ; 10 2 0 0 copy final 14 bytes Decrementing
bra.s TailDec15 ; 10 2 0 0 copy final 15 bytes Decrementing
bra.s TailDec16 ; 10 2 0 0 copy final 16 bytes Decrementing
bra.s TailDec17 ; 10 2 0 0 copy final 17 bytes Decrementing
bra.s TailDec18 ; 10 2 0 0 copy final 18 bytes Decrementing
bra.s TailDec19 ; 10 2 0 0 copy final 19 bytes Decrementing
bra.s TailDec20 ; 10 2 0 0 copy final 20 bytes Decrementing
bra.s TailDec21 ; 10 2 0 0 copy final 21 bytes Decrementing
bra.s TailDec22 ; 10 2 0 0 copy final 22 bytes Decrementing
bra.s TailDec23 ; 10 2 0 0 copy final 23 bytes Decrementing
bra.s TailDec24 ; 10 2 0 0 copy final 24 bytes Decrementing
bra.s TailDec25 ; 10 2 0 0 copy final 25 bytes Decrementing
bra.s TailDec26 ; 10 2 0 0 copy final 26 bytes Decrementing
bra.s TailDec27 ; 10 2 0 0 copy final 27 bytes Decrementing
bra.s TailDec28 ; 10 2 0 0 copy final 28 bytes Decrementing
bra.s TailDec29 ; 10 2 0 0 copy final 29 bytes Decrementing
bra.s TailDec30 ; 10 2 0 0 copy final 30 bytes Decrementing
TailDec31 move.l -(a0),-(a1) ; 22 1 2 2 copy final 31 bytes Decrementing
CopyTailDec ; copy final 0É31 bytes Decrementing
TailDec27 move.l -(a0),-(a1) ; 22 1 2 2 copy final 27 bytes Decrementing
TailDec23 move.l -(a0),-(a1) ; 22 1 2 2 copy final 23 bytes Decrementing
TailDec19 move.l -(a0),-(a1) ; 22 1 2 2 copy final 19 bytes Decrementing
TailDec15 move.l -(a0),-(a1) ; 22 1 2 2 copy final 15 bytes Decrementing
TailDec11 move.l -(a0),-(a1) ; 22 1 2 2 copy final 11 bytes Decrementing
TailDec07 move.l -(a0),-(a1) ; 22 1 2 2 copy final 7 bytes Decrementing
TailDec03 move.w -(a0),-(a1) ; 14 1 1 1 copy final 3 bytes Decrementing
TailDec01 move.b -(a0),-(a1) ; 14 1 1 1 copy final 1 byte Decrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
TailDec28 move.l -(a0),-(a1) ; 22 1 2 2 copy final 28 bytes Decrementing
TailDec24 move.l -(a0),-(a1) ; 22 1 2 2 copy final 24 bytes Decrementing
TailDec20 move.l -(a0),-(a1) ; 22 1 2 2 copy final 20 bytes Decrementing
TailDec16 move.l -(a0),-(a1) ; 22 1 2 2 copy final 16 bytes Decrementing
TailDec12 move.l -(a0),-(a1) ; 22 1 2 2 copy final 12 bytes Decrementing
TailDec08 move.l -(a0),-(a1) ; 22 1 2 2 copy final 8 bytes Decrementing
TailDec04 move.l -(a0),-(a1) ; 22 1 2 2 copy final 4 bytes Decrementing
TailDec00 moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
TailDec29 move.l -(a0),-(a1) ; 22 1 2 2 copy final 29 bytes Decrementing
TailDec25 move.l -(a0),-(a1) ; 22 1 2 2 copy final 25 bytes Decrementing
TailDec21 move.l -(a0),-(a1) ; 22 1 2 2 copy final 21 bytes Decrementing
TailDec17 move.l -(a0),-(a1) ; 22 1 2 2 copy final 17 bytes Decrementing
TailDec13 move.l -(a0),-(a1) ; 22 1 2 2 copy final 13 bytes Decrementing
TailDec09 move.l -(a0),-(a1) ; 22 1 2 2 copy final 9 bytes Decrementing
TailDec05 move.l -(a0),-(a1) ; 22 1 2 2 copy final 5 bytes Decrementing
move.b -(a0),-(a1) ; 14 1 1 1 copy final 1 byte Decrementing
moveq.l #noErr,d0 ; 4 1 0 0 return success status
rts ; 16 2 2 0 _BlockMove complete
Title 'BlockMove - Copy Decrementing 68020 / 68030'
; Align the destination to a longword boundary to reduce bus cycles. On a 68030
; with the data cache, the unaligned reads will cache, so that the same long word
; will not need to be read from RAM more than once. At this point we also know that
; we are moving more than 12 bytes, so that the 0É3 bytes of alignment will not
; exceed the length.
align alignment ; <1.6>
CopyDec68020
cmpa.l a0,a1 ; see if source=dest (no move needed)
beq.s TailDec00 ; if exactly the same, don't do the copy
adda.l d0,a0 ; point past end of source
adda.l d0,a1 ; point past end of destination
move.l a1,d1 ; get the destination address
moveq.l #3,d2 ; setup mask for longword alignment
and.l d1,d2 ; d2= 0É3, number of bytes to align
beq.s @aligned ; if already aligned, skip alignment
neg.l d2 ; negate to index backwards
jmp @align(d2.w*2) ; jump to the alignment routine
move.b -(a0),-(a1) ; -3, move 3 bytes to align
move.b -(a0),-(a1) ; -2, move 2 bytes to align
move.b -(a0),-(a1) ; -1, move 1 byte to align
@align add.l d2,d0 ; adjust the byte count after alignment
@aligned moveq.l #32,d2 ; byte count adjustment, 32 bytes at a time
sub.l d2,d0 ; setup for CopyTailDec
bge.s @longsLoop ; if 32 or more bytes left, use longword loop
jmp CopyTailDec(d0.w*2) ; copy the remaining bytes
@longsLoop move.l -(a0),-(a1) ; move 32 bytes, 4 at a time
move.l -(a0),-(a1) ; (using a DBRA in this loop would save
move.l -(a0),-(a1) ; 2 clocks per loop, but would incur
move.l -(a0),-(a1) ; overhead outside of the loop that would
move.l -(a0),-(a1) ; slow down the shorter and more frequent
move.l -(a0),-(a1) ; cases)
move.l -(a0),-(a1) ;
move.l -(a0),-(a1) ;
sub.l d2,d0 ; adjust for the 32 bytes just moved
bge.s @longsLoop ; loop until count is -32É-1
jmp CopyTailDec(d0.w*2) ; copy the remaining bytes
ENDP
END