Merge pull request #34 from specke/master

Added option for unrolled copying of long matches
Added an option for unrolling long match copying
2025-07-02 22:23:57 +00:00 · 2019-10-22 21:52:48 +02:00 · 2019-10-22 20:11:46 +01:00 · 2019-10-22 20:09:00 +01:00 · 2019-10-22 17:13:05 +02:00 · 2019-10-22 12:39:27 +02:00
31 changed files with 1374 additions and 1330 deletions
--- a/1
+++ b/1
@ -18,7 +18,6 @@ OBJS += $(OBJDIR)/src/expand_context.o
 OBJS += $(OBJDIR)/src/expand_inmem.o
 OBJS += $(OBJDIR)/src/expand_streaming.o
 OBJS += $(OBJDIR)/src/frame.o
-OBJS += $(OBJDIR)/src/hashmap.o
 OBJS += $(OBJDIR)/src/matchfinder.o
 OBJS += $(OBJDIR)/src/shrink_block_v1.o
 OBJS += $(OBJDIR)/src/shrink_block_v2.o
--- a/README.md
+++ b/README.md
@ -12,10 +12,11 @@ The compression formats give the user choices that range from decompressing fast
 Compression ratio comparison between LZSA and other optimal packers, for a workload composed of ZX Spectrum and C64 files:

                         Bytes            Ratio            Decompression speed vs. LZ4
-    LZSA2                685610           53,18% <------   75%                
+    LZSA2                676681           52,49% <------   75%   
+    MegaLZ 4.89          679041           52,68%           Not measured
    ZX7                  687133           53,30%           47,73%
    LZ5 1.4.1            727107           56,40%           75%
-    LZSA1                736169           57,11% <------   90%
+    LZSA1                735785           57,08% <------   90%
    Lizard -29           776122           60,21%           Not measured
    LZ4_HC -19 -B4 -BD   781049           60,59%           100%
    Uncompressed         1289127          100%             N/A
@ -23,13 +24,13 @@ Compression ratio comparison between LZSA and other optimal packers, for a workl
 Performance over well-known compression corpus files:

                         Uncompressed     LZ4_HC -19 -B4 -BD    LZSA1                LZSA2
-    Canterbury           2810784          935827 (33,29%)       855044 (30,42%)      789075 (28,07%)
-    Silesia              211938580        77299725 (36,47%)     73707039 (34,78%)    69983184 (33,02%)
-    Calgary              3251493          1248780 (38,40%)      1196448 (36,80%)     1125462 (34,61%)
-    Large                11159482         3771025 (33,79%)      3648420 (32,69%)     3528725 (31,62%)
-    enwik9               1000000000       371841591 (37,18%)    355360717 (35,54%)   337063553 (33,71%)
+    Canterbury           2810784          935827 (33,29%)       850792 (30,27%)      770877 (27,43%)
+    Silesia              211938580        77299725 (36,47%)     73706340 (34,78%)    68928564 (32,52%)
+    Calgary              3251493          1248780 (38,40%)      1192123 (36,67%)     1110290 (34,15%)
+    Large                11159482         3771025 (33,79%)      3648393 (32,69%)     3519480 (31,54%)
+    enwik9               1000000000       371841591 (37,18%)    355360043 (35,54%)   334900611 (33,49%)

-As an example of LZSA1's simplicity, a size-optimized decompressor on Z80 has been implemented in 69 bytes.
+As an example of LZSA1's simplicity, a size-optimized decompressor on Z80 has been implemented in 67 bytes.

 The compressor is approximately 2X slower than LZ4_HC but compresses better while maintaining similar decompression speeds and decompressor simplicity.

@ -41,6 +42,7 @@ The main differences between LZSA1 and the LZ4 compression format are:

 As for LZSA2:
 * 5-bit, 9-bit, 13-bit and 16-bit match offsets, using nibble encoding
+* Rep-matches
 * Shorter encoding of lengths, also using nibbles
 * A minmatch of 2 bytes
 * No (slow) bit-packing. LZSA2 uses byte alignment in the hot path, and nibbles.
@ -51,6 +53,8 @@ Inspirations:
 * [LZ5/Lizard](https://github.com/inikep/lizard) by Przemyslaw Skibinski and Yann Collet.
 * The suffix array intervals in [Wimlib](https://wimlib.net/git/?p=wimlib;a=tree) by Eric Biggers.
 * ZX7 by Einar Saukas
+* [apc](https://github.com/svendahl/cap) by Sven-<EFBFBD>ke Dahl
+* [Charles Bloom](http://cbloomrants.blogspot.com/)'s compression blog

 License:

@ -63,6 +67,12 @@ License:
 * 6502 and 8088 size-optimized improvements by [Peter Ferrie](https://github.com/peterferrie)
 * 8088 speed-optimized decompressor by [Jim Leonard](https://github.com/mobygamer)

+External links:
+
+* [i8080 decompressors](https://gitlab.com/ivagor/lzsa8080/tree/master) by Ivan Gorodetsky
+* [PDP-11 decompressors](https://gitlab.com/ivagor/lzsa8080/tree/master/PDP11) also by Ivan Gorodetsky
+* LZSA's page on [Pouet](https://www.pouet.net/prod.php?which=81573)
+
 # Compressed format

 Decompression code is provided for common 8-bit CPUs such as Z80 and 6502. However, if you would like to write your own, or understand the encoding, LZSA compresses data to a format that is fast and simple to decompress on 8-bit CPUs. It is encoded in either a stream of blocks, or as a single raw block, depending on command-line settings. The encoding is deliberately designed to avoid complicated operations on 8-bits (such as 16-bit math).
--- a/VS2017/lzsa.vcxproj
+++ b/VS2017/lzsa.vcxproj
@ -185,7 +185,6 @@
    <ClInclude Include="..\src\format.h" />
    <ClInclude Include="..\src\frame.h" />
    <ClInclude Include="..\src\expand_inmem.h" />
-    <ClInclude Include="..\src\hashmap.h" />
    <ClInclude Include="..\src\lib.h" />
    <ClInclude Include="..\src\libdivsufsort\include\divsufsort_config.h" />
    <ClInclude Include="..\src\libdivsufsort\include\divsufsort.h" />
@ -207,7 +206,6 @@
    <ClCompile Include="..\src\expand_block_v2.c" />
    <ClCompile Include="..\src\frame.c" />
    <ClCompile Include="..\src\expand_inmem.c" />
-    <ClCompile Include="..\src\hashmap.c" />
    <ClCompile Include="..\src\libdivsufsort\lib\divsufsort.c" />
    <ClCompile Include="..\src\libdivsufsort\lib\sssort.c" />
    <ClCompile Include="..\src\libdivsufsort\lib\trsort.c" />
--- a/VS2017/lzsa.vcxproj.filters
+++ b/VS2017/lzsa.vcxproj.filters
@ -84,9 +84,6 @@
    <ClInclude Include="..\src\libdivsufsort\include\divsufsort_config.h">
      <Filter>Fichiers sources\libdivsufsort\include</Filter>
    </ClInclude>
-    <ClInclude Include="..\src\hashmap.h">
-      <Filter>Fichiers sources</Filter>
-    </ClInclude>
  </ItemGroup>
  <ItemGroup>
    <ClCompile Include="..\src\libdivsufsort\lib\divsufsort.c">
@ -146,8 +143,5 @@
    <ClCompile Include="..\src\libdivsufsort\lib\divsufsort_utils.c">
      <Filter>Fichiers sources\libdivsufsort\lib</Filter>
    </ClCompile>
-    <ClCompile Include="..\src\hashmap.c">
-      <Filter>Fichiers sources</Filter>
-    </ClCompile>
  </ItemGroup>
 </Project>
--- a/Xcode/lzsa.xcodeproj/project.pbxproj
+++ b/Xcode/lzsa.xcodeproj/project.pbxproj
@ -26,7 +26,6 @@
 		0CADC64722AAD8EB003E9821 /* expand_context.c in Sources */ = {isa = PBXBuildFile; fileRef = 0CADC62F22AAD8EB003E9821 /* expand_context.c */; };
 		0CADC64822AAD8EB003E9821 /* shrink_block_v2.c in Sources */ = {isa = PBXBuildFile; fileRef = 0CADC63022AAD8EB003E9821 /* shrink_block_v2.c */; };
 		0CADC64A22AB8DAD003E9821 /* divsufsort_utils.c in Sources */ = {isa = PBXBuildFile; fileRef = 0CADC64922AB8DAD003E9821 /* divsufsort_utils.c */; };
-		0CADC69622C8A420003E9821 /* hashmap.c in Sources */ = {isa = PBXBuildFile; fileRef = 0CADC69522C8A41F003E9821 /* hashmap.c */; };
 /* End PBXBuildFile section */

 /* Begin PBXCopyFilesBuildPhase section */
@ -81,8 +80,6 @@
 		0CADC63022AAD8EB003E9821 /* shrink_block_v2.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; name = shrink_block_v2.c; path = ../../src/shrink_block_v2.c; sourceTree = "<group>"; };
 		0CADC64922AB8DAD003E9821 /* divsufsort_utils.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; path = divsufsort_utils.c; sourceTree = "<group>"; };
 		0CADC64B22AB8DC3003E9821 /* divsufsort_config.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = divsufsort_config.h; sourceTree = "<group>"; };
-		0CADC69422C8A41F003E9821 /* hashmap.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = hashmap.h; path = ../../src/hashmap.h; sourceTree = "<group>"; };
-		0CADC69522C8A41F003E9821 /* hashmap.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; name = hashmap.c; path = ../../src/hashmap.c; sourceTree = "<group>"; };
 /* End PBXFileReference section */

 /* Begin PBXFrameworksBuildPhase section */
@ -130,8 +127,6 @@
 				0CADC62422AAD8EB003E9821 /* format.h */,
 				0CADC5F322AAD8EB003E9821 /* frame.c */,
 				0CADC62C22AAD8EB003E9821 /* frame.h */,
-				0CADC69522C8A41F003E9821 /* hashmap.c */,
-				0CADC69422C8A41F003E9821 /* hashmap.h */,
 				0CADC5F222AAD8EB003E9821 /* lib.h */,
 				0CADC5FC22AAD8EB003E9821 /* libdivsufsort */,
 				0CADC62222AAD8EB003E9821 /* lzsa.c */,
@ -240,7 +235,6 @@
 			isa = PBXSourcesBuildPhase;
 			buildActionMask = 2147483647;
 			files = (
-				0CADC69622C8A420003E9821 /* hashmap.c in Sources */,
 				0CADC64822AAD8EB003E9821 /* shrink_block_v2.c in Sources */,
 				0CADC63D22AAD8EB003E9821 /* sssort.c in Sources */,
 				0CADC64322AAD8EB003E9821 /* expand_block_v2.c in Sources */,
--- a/asm/6502/decompress_v1.asm
+++ b/asm/6502/decompress_v1.asm
@ -49,10 +49,10 @@ DECODE_TOKEN

   AND #$70                             ; isolate literals count
   BEQ NO_LITERALS                      ; skip if no literals to copy
-   LSR A                                ; shift literals count into place
-   LSR A
-   LSR A
-   LSR A
+   LSR                                  ; shift literals count into place
+   LSR
+   LSR
+   LSR
   CMP #$07                             ; LITERALS_RUN_LEN?
   BCC PREPARE_COPY_LITERALS            ; if not, count is directly embedded in token

@ -71,7 +71,8 @@ LARGE_VARLEN_LITERALS                   ; handle 16 bits literals count
                                        ; literals count = directly these 16 bits
   JSR GETLARGESRC                      ; grab low 8 bits in X, high 8 bits in A
   TAY                                  ; put high 8 bits in Y
-   BYTE $A9                             ; mask TAX (faster than BCS)
+   TXA
+
 PREPARE_COPY_LITERALS
   TAX
   BEQ COPY_LITERALS
@ -91,7 +92,7 @@ NO_LITERALS

   JSR GETSRC                           ; get 8 bit offset from stream in A
   TAX                                  ; save for later
-   LDA #$0FF                            ; high 8 bits
+   LDA #$FF                             ; high 8 bits
   BNE GOT_OFFSET                       ; go prepare match
                                        ; (*like JMP GOT_OFFSET but shorter)

@ -110,7 +111,7 @@ COPY_MATCH_LOOP
   LDA $AAAA                            ; get one byte of backreference
   JSR PUTDST                           ; copy to destination

-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression -- put backreference bytes backward

@ -120,7 +121,7 @@ ifdef BACKWARD_DECOMPRESS
 GETMATCH_DONE
   DEC COPY_MATCH_LOOP+1

-else
+} else {

   ; Forward decompression -- put backreference bytes forward

@ -129,7 +130,7 @@ else
   INC COPY_MATCH_LOOP+2
 GETMATCH_DONE

-endif
+}

   DEX
   BNE COPY_MATCH_LOOP
@ -142,7 +143,7 @@ GET_LONG_OFFSET                         ; handle 16 bit offset:

 GOT_OFFSET

-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression - substract match offset

@ -160,7 +161,7 @@ OFFSHI = *+1
   STA COPY_MATCH_LOOP+2                ; store high 8 bits of address
   SEC

-else
+} else {

   ; Forward decompression - add match offset

@ -176,7 +177,7 @@ OFFSHI = *+1
   ADC PUTDST+2
   STA COPY_MATCH_LOOP+2                ; store high 8 bits of address
   
-endif
+}

   PLA                                  ; retrieve token from stack again
   AND #$0F                             ; isolate match len (MMMM)
@ -200,7 +201,7 @@ endif
 DECOMPRESSION_DONE
   RTS

-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression -- get and put bytes backward

@ -235,7 +236,7 @@ GETSRC_DONE
   PLA
   RTS

-else
+} else {

   ; Forward decompression -- get and put bytes forward

@ -266,4 +267,4 @@ LZSA_SRC_HI = *+2
 GETSRC_DONE
   RTS

-endif
+}
--- a/asm/6502/decompress_v2.asm
+++ b/asm/6502/decompress_v2.asm
@ -53,9 +53,9 @@ DECODE_TOKEN

   AND #$18                             ; isolate literals count (LL)
   BEQ NO_LITERALS                      ; skip if no literals to copy
-   LSR A                                ; shift literals count into place
-   LSR A
-   LSR A
+   LSR                                  ; shift literals count into place
+   LSR
+   LSR
   CMP #$03                             ; LITERALS_RUN_LEN_V2?
   BCC PREPARE_COPY_LITERALS            ; if less, count is directly embedded in token

@ -102,7 +102,7 @@ NO_LITERALS
    
                                        ; 00Z: 5 bit offset

-   LDX #$0FF                            ; set offset bits 15-8 to 1
+   LDX #$FF                             ; set offset bits 15-8 to 1

   JSR GETCOMBINEDBITS                  ; rotate Z bit into bit 0, read nibble for bits 4-1
   ORA #$E0                             ; set bits 7-5 to 1
@ -142,7 +142,7 @@ GOT_OFFSET_LO
   STX OFFSHI                           ; store high byte of match offset

 REP_MATCH
-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression - substract match offset

@ -157,7 +157,7 @@ OFFSHI = *+1
   STA COPY_MATCH_LOOP+2                ; store high 8 bits of address
   SEC

-else
+} else {

   ; Forward decompression - add match offset

@ -171,7 +171,7 @@ OFFSHI = *+1
   ADC PUTDST+2
   STA COPY_MATCH_LOOP+2                ; store high 8 bits of address
   
-endif
+}
   
   PLA                                  ; retrieve token from stack again
   AND #$07                             ; isolate match len (MMM)
@ -208,7 +208,7 @@ COPY_MATCH_LOOP
   LDA $AAAA                            ; get one byte of backreference
   JSR PUTDST                           ; copy to destination

-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression -- put backreference bytes backward

@ -218,7 +218,7 @@ ifdef BACKWARD_DECOMPRESS
 GETMATCH_DONE
   DEC COPY_MATCH_LOOP+1

-else
+} else {

   ; Forward decompression -- put backreference bytes forward

@ -227,7 +227,7 @@ else
   INC COPY_MATCH_LOOP+2
 GETMATCH_DONE

-endif
+}

   DEX
   BNE COPY_MATCH_LOOP
@ -266,7 +266,7 @@ HAS_NIBBLES
   AND #$0F                             ; isolate low 4 bits of nibble
   RTS

-ifdef BACKWARD_DECOMPRESS
+!ifdef BACKWARD_DECOMPRESS {

   ; Backward decompression -- get and put bytes backward

@ -301,7 +301,7 @@ GETSRC_DONE
   PLA
   RTS

-else
+} else {

   ; Forward decompression -- get and put bytes forward

@ -332,4 +332,5 @@ LZSA_SRC_HI = *+2
 GETSRC_DONE
   RTS

-endif
+}
+
--- a/asm/z80/unlzsa1_fast_v1.asm
+++ b/asm/z80/unlzsa1_fast_v1.asm
@ -1,5 +1,15 @@
 ;
-;  Speed-optimized LZSA decompressor by spke (v.1 03-25/04/2019, 110 bytes)
+;  Speed-optimized LZSA1 decompressor by spke & uniabis (111 bytes)
+;
+;  ver.00 by spke for LZSA 0.5.4 (03-24/04/2019, 134 bytes);
+;  ver.01 by spke for LZSA 0.5.6 (25/04/2019, 110(-24) bytes, +0.2% speed);
+;  ver.02 by spke for LZSA 1.0.5 (24/07/2019, added support for backward decompression);
+;  ver.03 by uniabis (30/07/2019, 109(-1) bytes, +3.5% speed);
+;  ver.04 by spke (31/07/2019, small re-organization of macros);
+;  ver.05 by uniabis (22/08/2019, 107(-2) bytes, same speed);
+;  ver.06 by spke for LZSA 1.0.7 (27/08/2019, 111(+4) bytes, +2.1% speed);
+;  ver.07 by spke for LZSA 1.1.0 (25/09/2019, added full revision history);
+;  ver.08 by spke for LZSA 1.1.2 (22/10/2019, re-organized macros and added an option for unrolled copying of long matches)
 ;
 ;  The data must be compressed using the command line compressor by Emmanuel Marty
 ;  The compression is done as follows:
@ -12,7 +22,7 @@
 ;
 ;  ld hl,FirstByteOfCompressedData
 ;  ld de,FirstByteOfMemoryForDecompressedData
-;  call DecompressLZSA
+;  call DecompressLZSA1
 ;
 ;  Backward compression is also supported; you can compress files backward using:
 ;
@ -22,11 +32,11 @@
 ;
 ;  ld hl,LastByteOfCompressedData
 ;  ld de,LastByteOfMemoryForDecompressedData
-;  call DecompressLZSA
+;  call DecompressLZSA1
 ;
 ;  (do not forget to uncomment the BACKWARD_DECOMPRESS option in the decompressor).
 ;
-;  Of course, LZSA compression algorithm is (c) 2019 Emmanuel Marty,
+;  Of course, LZSA compression algorithms are (c) 2019 Emmanuel Marty,
 ;  see https://github.com/emmanuel-marty/lzsa for more information
 ;
 ;  Drop me an email if you have any comments/ideas/suggestions: zxintrospec@gmail.com
@ -47,52 +57,63 @@
 ;     misrepresented as being the original software.
 ;  3. This notice may not be removed or altered from any source distribution.

+;	DEFINE	UNROLL_LONG_MATCHES						; uncomment for faster decompression of very compressible data (+57 bytes)
 ;	DEFINE	BACKWARD_DECOMPRESS

-	IFDEF	BACKWARD_DECOMPRESS
-
-		MACRO NEXT_HL
-		dec hl
-		ENDM
-
-		MACRO ADD_OFFSET
-		or a : sbc hl,de
-		ENDM
-
-		MACRO BLOCKCOPY
-		lddr
-		ENDM
-
-	ELSE
+	IFNDEF	BACKWARD_DECOMPRESS

 		MACRO NEXT_HL
 		inc hl
 		ENDM

 		MACRO ADD_OFFSET
-		add hl,de
+		ex de,hl : add hl,de
 		ENDM

-		MACRO BLOCKCOPY
+		MACRO COPY1
+		ldi
+		ENDM
+
+		MACRO COPYBC
 		ldir
 		ENDM

+	ELSE
+
+		MACRO NEXT_HL
+		dec hl
+		ENDM
+
+		MACRO ADD_OFFSET
+		ex de,hl : ld a,e : sub l : ld l,a
+		ld a,d : sbc h : ld h,a						; 4*4+3*4 = 28t / 7 bytes
+		ENDM
+
+		MACRO COPY1
+		ldd
+		ENDM
+
+		MACRO COPYBC
+		lddr
+		ENDM
+
 	ENDIF

-@DecompressLZSA:
+@DecompressLZSA1:
 		ld b,0 : jr ReadToken

-NoLiterals:	xor (hl) : NEXT_HL
-		push de : ld e,(hl) : NEXT_HL : jp m,LongOffset
+NoLiterals:	xor (hl)
+		push de : NEXT_HL : ld e,(hl) : jp m,LongOffset

 		; short matches have length 0+3..14+3
 ShortOffset:	ld d,#FF : add 3 : cp 15+3 : jr nc,LongerMatch

 		; placed here this saves a JP per iteration
 CopyMatch:	ld c,a
-.UseC		ex (sp),hl : push hl						; BC = len, DE = offset, HL = dest, SP ->[dest,src]
-		ADD_OFFSET : pop de						; BC = len, DE = dest, HL = dest-offset, SP->[src]
-		BLOCKCOPY : pop hl						; BC = 0, DE = dest, HL = src
+.UseC		NEXT_HL : ex (sp),hl						; BC = len, DE = offset, HL = dest, SP ->[dest,src]
+		ADD_OFFSET							; BC = len, DE = dest, HL = dest-offset, SP->[src]
+		COPY1 : COPY1 : COPYBC						; BC = 0, DE = dest
+.popSrc		pop hl								; HL = src
 	
 ReadToken:	; first a byte token "O|LLL|MMMM" is read from the stream,
 		; where LLL is the number of literals and MMMM is
@ -102,44 +123,80 @@ ReadToken:	; first a byte token "O|LLL|MMMM" is read from the stream,
 		cp #70 : jr z,MoreLiterals					; LLL=7 means 7+ literals...
 		rrca : rrca : rrca : rrca					; LLL<7 means 0..6 literals...

-		ld c,a : ld a,(hl) : NEXT_HL
-		BLOCKCOPY
+		ld c,a : ld a,(hl)
+		NEXT_HL : COPYBC

 		; next we read the first byte of the offset
-		push de : ld e,(hl) : NEXT_HL
+		push de : ld e,(hl)
 		; the top bit of token is set if the offset contains two bytes
 		and #8F : jp p,ShortOffset

 LongOffset:	; read second byte of the offset
-		ld d,(hl) : NEXT_HL
+		NEXT_HL : ld d,(hl)
 		add -128+3 : cp 15+3 : jp c,CopyMatch

+	IFNDEF	UNROLL_LONG_MATCHES
+
 		; MMMM=15 indicates a multi-byte number of literals
-LongerMatch:	add (hl) : NEXT_HL : jr nc,CopyMatch
+LongerMatch:	NEXT_HL : add (hl) : jr nc,CopyMatch

 		; the codes are designed to overflow;
 		; the overflow value 1 means read 1 extra byte
 		; and overflow value 0 means read 2 extra bytes
-.code1		ld b,a : ld c,(hl) : NEXT_HL : jr nz,CopyMatch.UseC
-.code0		ld b,(hl) : NEXT_HL
+.code1		ld b,a : NEXT_HL : ld c,(hl) : jr nz,CopyMatch.UseC
+.code0		NEXT_HL : ld b,(hl)

 		; the two-byte match length equal to zero
 		; designates the end-of-data marker
 		ld a,b : or c : jr nz,CopyMatch.UseC
 		pop de : ret

+	ELSE
+
+		; MMMM=15 indicates a multi-byte number of literals
+LongerMatch:	NEXT_HL : add (hl) : jr c,VeryLongMatch
+
+		ld c,a
+.UseC		NEXT_HL : ex (sp),hl
+		ADD_OFFSET
+		COPY1 : COPY1
+
+		; this is an unrolled equivalent of LDIR
+		xor a : sub c
+		and 16-1 : add a
+		ld (.jrOffset),a : jr nz,$+2
+.jrOffset	EQU $-1
+.fastLDIR	DUP 16
+		COPY1
+		EDUP
+		jp pe,.fastLDIR
+		jp CopyMatch.popSrc
+
+VeryLongMatch:	; the codes are designed to overflow;
+		; the overflow value 1 means read 1 extra byte
+		; and overflow value 0 means read 2 extra bytes
+.code1		ld b,a : NEXT_HL : ld c,(hl) : jr nz,LongerMatch.UseC
+.code0		NEXT_HL : ld b,(hl)
+
+		; the two-byte match length equal to zero
+		; designates the end-of-data marker
+		ld a,b : or c : jr nz,LongerMatch.UseC
+		pop de : ret
+
+	ENDIF
+
 MoreLiterals:	; there are three possible situations here
-		xor (hl) : NEXT_HL : exa
-		ld a,7 : add (hl) : NEXT_HL : jr c,ManyLiterals
+		xor (hl) : exa
+		ld a,7 : NEXT_HL : add (hl) : jr c,ManyLiterals

 CopyLiterals:	ld c,a
-.UseC		BLOCKCOPY
+.UseC		NEXT_HL : COPYBC

-		push de : ld e,(hl) : NEXT_HL
+		push de : ld e,(hl)
 		exa : jp p,ShortOffset : jr LongOffset

 ManyLiterals:
-.code1		ld b,a : ld c,(hl) : NEXT_HL : jr nz,CopyLiterals.UseC
-.code0		ld b,(hl) : NEXT_HL : jr CopyLiterals.UseC
+.code1		ld b,a : NEXT_HL : ld c,(hl) : jr nz,CopyLiterals.UseC
+.code0		NEXT_HL : ld b,(hl) : jr CopyLiterals.UseC


--- a/asm/z80/unlzsa1_small_v1.asm
+++ b/asm/z80/unlzsa1_small_v1.asm
@ -1,5 +1,12 @@
 ;
-;  Size-optimized LZSA decompressor by spke (v.1 23/04/2019, 69 bytes)
+;  Size-optimized LZSA1 decompressor by spke & uniabis (67 bytes)
+;
+;  ver.00 by spke for LZSA 0.5.4 (23/04/2019, 69 bytes);
+;  ver.01 by spke for LZSA 1.0.5 (24/07/2019, added support for backward decompression);
+;  ver.02 by uniabis (30/07/2019, 68(-1) bytes, +3.2% speed);
+;  ver.03 by spke for LZSA 1.0.7 (31/07/2019, small re-organization of macros);
+;  ver.04 by spke (06/08/2019, 67(-1) bytes, -1.2% speed);
+;  ver.05 by spke for LZSA 1.1.0 (25/09/2019, added full revision history)
 ;
 ;  The data must be compressed using the command line compressor by Emmanuel Marty
 ;  The compression is done as follows:
@ -12,7 +19,7 @@
 ;
 ;  ld hl,FirstByteOfCompressedData
 ;  ld de,FirstByteOfMemoryForDecompressedData
-;  call DecompressLZSA
+;  call DecompressLZSA1
 ;
 ;  Backward compression is also supported; you can compress files backward using:
 ;
@ -22,11 +29,11 @@
 ;
 ;  ld hl,LastByteOfCompressedData
 ;  ld de,LastByteOfMemoryForDecompressedData
-;  call DecompressLZSA
+;  call DecompressLZSA1
 ;
 ;  (do not forget to uncomment the BACKWARD_DECOMPRESS option in the decompressor).
 ;
-;  Of course, LZSA compression algorithm is (c) 2019 Emmanuel Marty,
+;  Of course, LZSA compression algorithms are (c) 2019 Emmanuel Marty,
 ;  see https://github.com/emmanuel-marty/lzsa for more information
 ;
 ;  Drop me an email if you have any comments/ideas/suggestions: zxintrospec@gmail.com
@ -49,43 +56,43 @@

 ;	DEFINE	BACKWARD_DECOMPRESS

-	IFDEF	BACKWARD_DECOMPRESS
-
-		MACRO NEXT_HL
-		dec hl
-		ENDM
-
-		MACRO ADD_OFFSET
-		or a : sbc hl,de
-		ENDM
-
-		MACRO BLOCKCOPY
-		lddr
-		ENDM
-
-	ELSE
+	IFNDEF	BACKWARD_DECOMPRESS

 		MACRO NEXT_HL
 		inc hl
 		ENDM

 		MACRO ADD_OFFSET
-		add hl,de
+		ex de,hl : add hl,de
 		ENDM

 		MACRO BLOCKCOPY
 		ldir
 		ENDM

+	ELSE
+
+		MACRO NEXT_HL
+		dec hl
+		ENDM
+
+		MACRO ADD_OFFSET
+		push hl : or a : sbc hl,de : pop de				; 11+4+15+10 = 40t / 5 bytes
+		ENDM
+
+		MACRO BLOCKCOPY
+		lddr
+		ENDM
+
 	ENDIF

-@DecompressLZSA:
+@DecompressLZSA1:
 		ld b,0

 		; first a byte token "O|LLL|MMMM" is read from the stream,
 		; where LLL is the number of literals and MMMM is
 		; a length of the match that follows after the literals
-ReadToken:	ld a,(hl) : exa : ld a,(hl) : NEXT_HL
+ReadToken:	ld a,(hl) : NEXT_HL : push af
 		and #70 : jr z,NoLiterals

 		rrca : rrca : rrca : rrca					; LLL<7 means 0..6 literals...
@ -94,10 +101,10 @@ ReadToken:	ld a,(hl) : exa : ld a,(hl) : NEXT_HL
 		ld c,a : BLOCKCOPY

 		; next we read the low byte of the -offset
-NoLiterals:	push de : ld e,(hl) : NEXT_HL : ld d,#FF
+NoLiterals:	pop af : push de : ld e,(hl) : NEXT_HL : ld d,#FF
 		; the top bit of token is set if
 		; the offset contains the high byte as well
-		exa : or a : jp p,ShortOffset
+		or a : jp p,ShortOffset

 LongOffset:	ld d,(hl) : NEXT_HL

@ -106,10 +113,10 @@ ShortOffset:	and #0F : add 3							; MMMM<15 means match lengths 0+3..14+3
 		cp 15+3 : call z,ReadLongBA					; MMMM=15 means lengths 14+3+
 		ld c,a

-		ex (sp),hl : push hl						; BC = len, DE = -offset, HL = dest, SP ->[dest,src]
-		ADD_OFFSET : pop de						; BC = len, DE = dest, HL = dest+(-offset), SP->[src]
-		BLOCKCOPY : pop hl						; BC = 0, DE = dest, HL = src
-		jr ReadToken
+		ex (sp),hl							; BC = len, DE = -offset, HL = dest, SP -> [src]
+		ADD_OFFSET							; BC = len, DE = dest, HL = dest+(-offset), SP -> [src]
+		BLOCKCOPY							; BC = 0, DE = dest
+		pop hl : jr ReadToken						; HL = src

 		; a standard routine to read extended codes
 		; into registers B (higher byte) and A (lower byte).
--- a/asm/z80/unlzsa2_fast_v1.asm
+++ b/asm/z80/unlzsa2_fast_v1.asm
@ -1,5 +1,14 @@
 ;
-;  Speed-optimized LZSA2 decompressor by spke (v.1 02-07/06/2019, 218 bytes)
+;  Speed-optimized LZSA2 decompressor by spke & uniabis (216 bytes)
+;
+;  ver.00 by spke for LZSA 1.0.0 (02-07/06/2019, 218 bytes);
+;  ver.01 by spke for LZSA 1.0.5 (24/07/2019, added support for backward decompression);
+;  ver.02 by spke for LZSA 1.0.6 (27/07/2019, fixed a bug in the backward decompressor);
+;  ver.03 by uniabis (30/07/2019, 213(-5) bytes, +3.8% speed and support for Hitachi HD64180);
+;  ver.04 by spke for LZSA 1.0.7 (01/08/2019, 214(+1) bytes, +0.2% speed and small re-organization of macros);
+;  ver.05 by spke (27/08/2019, 216(+2) bytes, +1.1% speed);
+;  ver.06 by spke for LZSA 1.1.0 (26/09/2019, added full revision history);
+;  ver.07 by spke for LZSA 1.1.1 (10/10/2019, +0.2% speed and an option for unrolled copying of long matches)
 ;
 ;  The data must be compressed using the command line compressor by Emmanuel Marty
 ;  The compression is done as follows:
@ -26,7 +35,7 @@
 ;
 ;  (do not forget to uncomment the BACKWARD_DECOMPRESS option in the decompressor).
 ;
-;  Of course, LZSA2 compression algorithm is (c) 2019 Emmanuel Marty,
+;  Of course, LZSA2 compression algorithms are (c) 2019 Emmanuel Marty,
 ;  see https://github.com/emmanuel-marty/lzsa for more information
 ;
 ;  Drop me an email if you have any comments/ideas/suggestions: zxintrospec@gmail.com
@ -47,77 +56,89 @@
 ;     misrepresented as being the original software.
 ;  3. This notice may not be removed or altered from any source distribution.

-;	DEFINE	BACKWARD_DECOMPRESS
+;	DEFINE	UNROLL_LONG_MATCHES						; uncomment for faster decompression of very compressible data (+38 bytes)
+;	DEFINE	BACKWARD_DECOMPRESS						; uncomment for data compressed with option -b
+;	DEFINE	HD64180								; uncomment for systems using Hitachi HD64180

-	IFDEF	BACKWARD_DECOMPRESS
-
-		MACRO NEXT_HL
-		dec hl
-		ENDM
-
-		MACRO ADD_OFFSET
-		or a : sbc hl,de
-		ENDM
-
-		MACRO BLOCKCOPY
-		lddr
-		ENDM
-
-	ELSE
+	IFNDEF	BACKWARD_DECOMPRESS

 		MACRO NEXT_HL
 		inc hl
 		ENDM

 		MACRO ADD_OFFSET
-		add hl,de
+		ex de,hl : add hl,de
 		ENDM

-		MACRO BLOCKCOPY
+		MACRO COPY1
+		ldi
+		ENDM
+
+		MACRO COPYBC
 		ldir
 		ENDM

+	ELSE
+
+		MACRO NEXT_HL
+		dec hl
+		ENDM
+
+		MACRO ADD_OFFSET
+		ex de,hl : ld a,e : sub l : ld l,a
+		ld a,d : sbc h : ld h,a						; 4*4+3*4 = 28t / 7 bytes
+		ENDM
+
+		MACRO COPY1
+		ldd
+		ENDM
+
+		MACRO COPYBC
+		lddr
+		ENDM
+
+	ENDIF
+
+	IFNDEF	HD64180
+
+		MACRO LD_IX_DE
+		ld ixl,e : ld ixh,d
+		ENDM
+
+		MACRO LD_DE_IX
+		ld e,ixl : ld d,ixh
+		ENDM
+
+	ELSE
+
+		MACRO LD_IX_DE
+		push de : pop ix
+		ENDM
+
+		MACRO LD_DE_IX
+		push ix : pop de
+		ENDM
+
 	ENDIF

@DecompressLZSA2:
 		; A' stores next nibble as %1111.... or assumed to contain trash
 		; B is assumed to be 0
-		xor a : ld b,a : exa : jr ReadToken
+		ld b,0 : scf : exa : jr ReadToken




-
-LongerMatch:	exa : jp m,.noUpdate
-
-			ld a,(hl) : or #F0 : exa
-			ld a,(hl) : NEXT_HL : or #0F
-			rrca : rrca : rrca : rrca
-
-.noUpdate	sub #F0-9 : cp 15+9 : jr c,CopyMatch
-		;inc a : jr z,LongMatch : sub #F0-9+1 : jp CopyMatch
-
-LongMatch:	;ld a,24 : 
-		add (hl) : NEXT_HL : jr nc,CopyMatch
+ManyLiterals:	ld a,18 : add (hl) : NEXT_HL : jr nc,CopyLiterals
 		ld c,(hl) : NEXT_HL
-		ld b,(hl) : NEXT_HL
-		jr nz,CopyMatch.useC
-		pop de : ret
-
-
-
-
-ManyLiterals:	ld a,18 : 
-		add (hl) : NEXT_HL : jr nc,CopyLiterals
-		ld c,(hl) : NEXT_HL
-		ld a,b : ld b,(hl) : inc hl
-		jr CopyLiterals.useBC
+		ld a,b : ld b,(hl)
+		jr ReadToken.NEXTHLuseBC




 MoreLiterals:	ld b,(hl) : NEXT_HL
-		exa : jp m,.noUpdate
+		scf : exa : jr nc,.noUpdate

 			ld a,(hl) : or #F0 : exa
 			ld a,(hl) : NEXT_HL : or #0F
@ -126,18 +147,25 @@ MoreLiterals:	ld b,(hl) : NEXT_HL
 .noUpdate	;sub #F0-3 : cp 15+3 : jr z,ManyLiterals
 		inc a : jr z,ManyLiterals : sub #F0-3+1

-CopyLiterals:	ld c,a
-.useC		ld a,b : ld b,0
-.useBC		BLOCKCOPY
-		push de : or a : jp p,CASE0xx : jr CASE1xx
+CopyLiterals:	ld c,a : ld a,b : ld b,0
+		COPYBC
+		push de : or a : jp p,CASE0xx ;: jr CASE1xx
+
+		cp %11000000 : jr c,CASE10x
+
+CASE11x		cp %11100000 : jr c,CASE110
+
+		; "111": repeated offset
+CASE111:	LD_DE_IX : jr MatchLen




+Literals0011:	jr nz,MoreLiterals

 		; if "LL" of the byte token is equal to 0,
 		; there are no literals to copy
-NoLiterals:	xor (hl) : NEXT_HL
+NoLiterals:	or (hl) : NEXT_HL
 		push de : jp m,CASE1xx

 		; short (5 or 9 bit long) offsets
@ -146,47 +174,54 @@ CASE0xx		ld d,#FF : cp %01000000 : jr c,CASE00x
 		; "01x": the case of the 9-bit offset
 CASE01x:	cp %01100000 : rl d

-ReadOffsetE:	ld e,(hl) : NEXT_HL
+ReadOffsetE	ld e,(hl) : NEXT_HL

-SaveOffset:	ld ixl,e : ld ixh,d
+SaveOffset:	LD_IX_DE

 MatchLen:	inc a : and %00000111 : jr z,LongerMatch : inc a

 CopyMatch:	ld c,a
-.useC		ex (sp),hl : push hl					; BC = len, DE = offset, HL = dest, SP ->[dest,src]
-		ADD_OFFSET : pop de					; BC = len, DE = dest, HL = dest-offset, SP->[src]
-		BLOCKCOPY : pop hl
+.useC		ex (sp),hl						; BC = len, DE = offset, HL = dest, SP ->[dest,src]
+		ADD_OFFSET						; BC = len, DE = dest, HL = dest-offset, SP->[src]
+		COPY1
+		COPYBC
+.popSrc		pop hl

 		; compressed data stream contains records
 		; each record begins with the byte token "XYZ|LL|MMM"
-ReadToken:	ld a,(hl) : and %00011000 : jr z,NoLiterals
+ReadToken:	ld a,(hl) : and %00011000 : jp pe,Literals0011		; process the cases 00 and 11 separately

-		jp pe,MoreLiterals					; 00 has already been processed; this identifies the case of 11
 		rrca : rrca : rrca

-		ld c,a : ld a,(hl) : NEXT_HL				; token is re-read for further processing
-		BLOCKCOPY
+		ld c,a : ld a,(hl)					; token is re-read for further processing
+.NEXTHLuseBC	NEXT_HL
+		COPYBC

 		; the token and literals are followed by the offset
 		push de : or a : jp p,CASE0xx

 CASE1xx		cp %11000000 : jr nc,CASE11x

-		; "10x": the case of the 5-bit offset
-CASE10x:	ld c,a : xor a
-		exa : jp m,.noUpdate
+		; "10x": the case of the 13-bit offset
+CASE10x:	ld c,a : exa : jr nc,.noUpdate

 			ld a,(hl) : or #F0 : exa
 			ld a,(hl) : NEXT_HL : or #0F
 			rrca : rrca : rrca : rrca

 .noUpdate	ld d,a : ld a,c
-		cp %10100000 : rl d
-		dec d : dec d : jr ReadOffsetE
+		cp %10100000 : dec d : rl d : jr ReadOffsetE
+
+
+		
+		; "110": 16-bit offset
+CASE110:	ld d,(hl) : NEXT_HL : jr ReadOffsetE
+
+
+

 		; "00x": the case of the 5-bit offset
-CASE00x:	ld c,a : xor a
-		exa : jp m,.noUpdate
+CASE00x:	ld c,a : exa : jr nc,.noUpdate

 			ld a,(hl) : or #F0 : exa
 			ld a,(hl) : NEXT_HL : or #0F
@ -195,24 +230,52 @@ CASE00x:	ld c,a : xor a
 .noUpdate	ld e,a : ld a,c
 		cp %00100000 : rl e : jp SaveOffset

-		; two remaining cases
-CASE11x		cp %11100000 : jr c,CASE110
-
-		; "111": repeated offset
-CASE111:	ld e,ixl : ld d,ixh : jr MatchLen
-
-		; "110": 16-bit offset
-CASE110:	ld d,(hl) : NEXT_HL : jr ReadOffsetE
-
-
-
-
-
-




+LongerMatch:	scf : exa : jr nc,.noUpdate
+
+			ld a,(hl) : or #F0 : exa
+			ld a,(hl) : NEXT_HL : or #0F
+			rrca : rrca : rrca : rrca
+
+.noUpdate	sub #F0-9 : cp 15+9 : jr c,CopyMatch
+
+	IFNDEF	UNROLL_LONG_MATCHES
+
+LongMatch:	add (hl) : NEXT_HL : jr nc,CopyMatch
+		ld c,(hl) : NEXT_HL
+		ld b,(hl) : NEXT_HL : jr nz,CopyMatch.useC
+		pop de : ret
+
+	ELSE
+
+LongMatch:	add (hl) : NEXT_HL : jr c,VeryLongMatch
+
+		ld c,a
+.useC		ex (sp),hl
+		ADD_OFFSET
+		COPY1
+
+		; this is an unrolled equivalent of LDIR
+		xor a : sub c
+		and 8-1 : add a
+		ld (.jrOffset),a : jr nz,$+2
+.jrOffset	EQU $-1
+.fastLDIR	DUP 8
+		COPY1
+		EDUP
+		jp pe,.fastLDIR
+		jp CopyMatch.popSrc
+
+VeryLongMatch:	ld c,(hl) : NEXT_HL
+		ld b,(hl) : NEXT_HL : jr nz,LongMatch.useC
+		pop de : ret
+
+	ENDIF
+
+



--- a/asm/z80/unlzsa2_small_v1.asm
+++ b/asm/z80/unlzsa2_small_v1.asm
@ -1,5 +1,12 @@
 ;
-;  Size-optimized LZSA2 decompressor by spke (v.1 02-09/06/2019, 145 bytes)
+;  Size-optimized LZSA2 decompressor by spke & uniabis (139 bytes)
+;
+;  ver.00 by spke for LZSA 1.0.0 (02-09/06/2019, 145 bytes);
+;  ver.01 by spke for LZSA 1.0.5 (24/07/2019, added support for backward decompression);
+;  ver.02 by uniabis (30/07/2019, 144(-1) bytes, +3.3% speed and support for Hitachi HD64180);
+;  ver.03 by spke for LZSA 1.0.7 (01/08/2019, 140(-4) bytes, -1.4% speed and small re-organization of macros);
+;  ver.04 by spke for LZSA 1.1.0 (26/09/2019, removed usage of IY, added full revision history)
+;  ver.05 by spke for LZSA 1.1.1 (11/10/2019, 139(-1) bytes, +0.1% speed)
 ;
 ;  The data must be compressed using the command line compressor by Emmanuel Marty
 ;  The compression is done as follows:
@ -26,7 +33,7 @@
 ;
 ;  (do not forget to uncomment the BACKWARD_DECOMPRESS option in the decompressor).
 ;
-;  Of course, LZSA2 compression algorithm is (c) 2019 Emmanuel Marty,
+;  Of course, LZSA2 compression algorithms are (c) 2019 Emmanuel Marty,
 ;  see https://github.com/emmanuel-marty/lzsa for more information
 ;
 ;  Drop me an email if you have any comments/ideas/suggestions: zxintrospec@gmail.com
@ -48,57 +55,85 @@
 ;  3. This notice may not be removed or altered from any source distribution.
 ;

-;	DEFINE	BACKWARD_DECOMPRESS
+;	DEFINE	BACKWARD_DECOMPRESS						; uncomment for data compressed with option -b
+;	DEFINE	HD64180								; uncomment for systems using Hitachi HD64180

-	IFDEF	BACKWARD_DECOMPRESS
-
-		MACRO NEXT_HL
-		dec hl
-		ENDM
-
-		MACRO ADD_OFFSET
-		or a : sbc hl,de
-		ENDM
-
-		MACRO BLOCKCOPY
-		lddr
-		ENDM
-
-	ELSE
+	IFNDEF	BACKWARD_DECOMPRESS

 		MACRO NEXT_HL
 		inc hl
 		ENDM

 		MACRO ADD_OFFSET
-		add hl,de
+		ex de,hl : add hl,de
 		ENDM

 		MACRO BLOCKCOPY
 		ldir
 		ENDM

+	ELSE
+
+		MACRO NEXT_HL
+		dec hl
+		ENDM
+
+		MACRO ADD_OFFSET
+		push hl : or a : sbc hl,de : pop de				; 11+4+15+10 = 40t / 5 bytes
+		ENDM
+
+		MACRO BLOCKCOPY
+		lddr
+		ENDM
+
+	ENDIF
+
+	IFNDEF	HD64180
+
+		MACRO LD_IX_DE
+		ld ixl,e : ld ixh,d
+		ENDM
+
+		MACRO LD_DE_IX
+		ld e,ixl : ld d,ixh
+		ENDM
+
+	ELSE
+
+		MACRO LD_IX_DE
+		push de : pop ix
+		ENDM
+
+		MACRO LD_DE_IX
+		push ix : pop de
+		ENDM
+
 	ENDIF

@DecompressLZSA2:
 		xor a : ld b,a : exa : jr ReadToken

+CASE00x:	call ReadNibble
+		ld e,a : ld a,c
+		cp %00100000 : rl e : jr SaveOffset
+
 CASE0xx		ld d,#FF : cp %01000000 : jr c,CASE00x

 CASE01x:	cp %01100000 : rl d

 OffsetReadE:	ld e,(hl) : NEXT_HL
 		
-SaveOffset:	ld iyl,e : ld iyh,d
+SaveOffset:	LD_IX_DE

 MatchLen:	and %00000111 : add 2 : cp 9 : call z,ExtendedCode

 CopyMatch:	ld c,a
-		ex (sp),hl : push hl						; BC = len, DE = offset, HL = dest, SP ->[dest,src]
-		ADD_OFFSET : pop de						; BC = len, DE = dest, HL = dest-offset, SP->[src]
-		BLOCKCOPY : pop hl
+		ex (sp),hl							; BC = len, DE = -offset, HL = dest, SP -> [src]
+		ADD_OFFSET							; BC = len, DE = dest, HL = dest+(-offset), SP -> [src]
+		BLOCKCOPY							; BC = 0, DE = dest
+		pop hl								; HL = src

-ReadToken:	ld a,(hl) : ld ixl,a : NEXT_HL
+ReadToken:	ld a,(hl) : NEXT_HL : push af
 		and %00011000 : jr z,NoLiterals

 		rrca : rrca : rrca
@ -107,32 +142,28 @@ ReadToken:	ld a,(hl) : ld ixl,a : NEXT_HL
 		ld c,a
 		BLOCKCOPY

-NoLiterals:	push de : ld a,ixl
+NoLiterals:	pop af : push de
 		or a : jp p,CASE0xx

 CASE1xx		cp %11000000 : jr nc,CASE11x

 CASE10x:	call ReadNibble
 		ld d,a : ld a,c
-		cp %10100000 : rl d
-		dec d : dec d : jr OffsetReadE
+		cp %10100000 ;: rl d
+		dec d : rl d : DB #CA ; jr OffsetReadE				; #CA is JP Z,.. to skip all commands in CASE110 before jr OffsetReadE

-CASE00x:	call ReadNibble
-		ld e,a : ld a,c
-		cp %00100000 : rl e : jr SaveOffset
+CASE110:	ld d,(hl) : NEXT_HL : jr OffsetReadE

 CASE11x		cp %11100000 : jr c,CASE110

-CASE111:	ld e,iyl : ld d,iyh : jr MatchLen
-
-CASE110:	ld d,(hl) : NEXT_HL : jr OffsetReadE
+CASE111:	LD_DE_IX : jr MatchLen

 ExtendedCode:	call ReadNibble : inc a : jr z,ExtraByte
 		sub #F0+1 : add c : ret
 ExtraByte	ld a,15 : add c : add (hl) : NEXT_HL : ret nc
 		ld a,(hl) : NEXT_HL
 		ld b,(hl) : NEXT_HL : ret nz
-		pop de : pop de : ret
+		pop de : pop de							; RET is not needed, because RET from ReadNibble is sufficient

 ReadNibble:	ld c,a : xor a : exa : ret m
 UpdateNibble	ld a,(hl) : or #F0 : exa
--- a/pareto_graph.png
+++ b/pareto_graph.png
--- a/src/dictionary.c
+++ b/src/dictionary.c
@ -96,6 +96,6 @@ int lzsa_dictionary_load(const char *pszDictionaryFilename, void **ppDictionaryD
 void lzsa_dictionary_free(void **ppDictionaryData) {
   if (*ppDictionaryData) {
      free(*ppDictionaryData);
-      ppDictionaryData = NULL;
+      *ppDictionaryData = NULL;
   }
 }
--- a/src/expand_block_v1.c
+++ b/src/expand_block_v1.c
@ -1,5 +1,5 @@
 /*
- * expand_v1.c - LZSA1 block decompressor implementation
+ * expand_block_v1.c - LZSA1 block decompressor implementation
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
@ -166,7 +166,7 @@ int lzsa_decompressor_expand_block_v1(const unsigned char *pInBlock, int nBlockS
         const unsigned char *pSrc = pCurOutData - nMatchOffset;
         if (pSrc >= pOutData) {
            unsigned int nMatchLen = (unsigned int)(token & 0x0f);
-            if (nMatchLen != MATCH_RUN_LEN_V1 && nMatchOffset >= 8 && pCurOutData < pOutDataFastEnd) {
+            if (nMatchLen != MATCH_RUN_LEN_V1 && nMatchOffset >= 8 && pCurOutData < pOutDataFastEnd && (pSrc + 18) <= pOutDataEnd) {
               memcpy(pCurOutData, pSrc, 8);
               memcpy(pCurOutData + 8, pSrc + 8, 8);
               memcpy(pCurOutData + 16, pSrc + 16, 2);
@ -181,27 +181,32 @@ int lzsa_decompressor_expand_block_v1(const unsigned char *pInBlock, int nBlockS
                     break;
               }

-               if ((pCurOutData + nMatchLen) <= pOutDataEnd) {
-                  /* Do a deterministic, left to right byte copy instead of memcpy() so as to handle overlaps */
+               if ((pSrc + nMatchLen) <= pOutDataEnd) {
+                  if ((pCurOutData + nMatchLen) <= pOutDataEnd) {
+                     /* Do a deterministic, left to right byte copy instead of memcpy() so as to handle overlaps */

-                  if (nMatchOffset >= 16 && (pCurOutData + nMatchLen) < (pOutDataFastEnd - 15)) {
-                     const unsigned char *pCopySrc = pSrc;
-                     unsigned char *pCopyDst = pCurOutData;
-                     const unsigned char *pCopyEndDst = pCurOutData + nMatchLen;
+                     if (nMatchOffset >= 16 && (pCurOutData + nMatchLen) < (pOutDataFastEnd - 15)) {
+                        const unsigned char *pCopySrc = pSrc;
+                        unsigned char *pCopyDst = pCurOutData;
+                        const unsigned char *pCopyEndDst = pCurOutData + nMatchLen;

-                     do {
-                        memcpy(pCopyDst, pCopySrc, 16);
-                        pCopySrc += 16;
-                        pCopyDst += 16;
-                     } while (pCopyDst < pCopyEndDst);
+                        do {
+                           memcpy(pCopyDst, pCopySrc, 16);
+                           pCopySrc += 16;
+                           pCopyDst += 16;
+                        } while (pCopyDst < pCopyEndDst);

-                     pCurOutData += nMatchLen;
+                        pCurOutData += nMatchLen;
+                     }
+                     else {
+                        while (nMatchLen) {
+                           *pCurOutData++ = *pSrc++;
+                           nMatchLen--;
+                        }
+                     }
                  }
                  else {
-                     while (nMatchLen) {
-                        *pCurOutData++ = *pSrc++;
-                        nMatchLen--;
-                     }
+                     return -1;
                  }
               }
               else {
--- a/src/expand_block_v1.h
+++ b/src/expand_block_v1.h
@ -1,5 +1,5 @@
 /*
- * expand_v1.h - LZSA1 block decompressor definitions
+ * expand_block_v1.h - LZSA1 block decompressor definitions
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
@ -30,8 +30,8 @@
 *
 */

-#ifndef _EXPAND_V1_H
-#define _EXPAND_V1_H
+#ifndef _EXPAND_BLOCK_V1_H
+#define _EXPAND_BLOCK_V1_H

 /**
 * Decompress one LZSA1 data block
@ -46,4 +46,4 @@
 */
 int lzsa_decompressor_expand_block_v1(const unsigned char *pInBlock, int nBlockSize, unsigned char *pOutData, int nOutDataOffset, int nBlockMaxSize);

-#endif /* _EXPAND_V1_H */
+#endif /* _EXPAND_BLOCK_V1_H */
--- a/src/expand_block_v2.c
+++ b/src/expand_block_v2.c
@ -1,5 +1,5 @@
 /*
- * expand_v2.c - LZSA2 block decompressor implementation
+ * expand_block_v2.c - LZSA2 block decompressor implementation
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
@ -195,7 +195,7 @@ int lzsa_decompressor_expand_block_v2(const unsigned char *pInBlock, int nBlockS
         const unsigned char *pSrc = pCurOutData - nMatchOffset;
         if (pSrc >= pOutData) {
            unsigned int nMatchLen = (unsigned int)(token & 0x07);
-            if (nMatchLen != MATCH_RUN_LEN_V2 && nMatchOffset >= 8 && pCurOutData < pOutDataFastEnd) {
+            if (nMatchLen != MATCH_RUN_LEN_V2 && nMatchOffset >= 8 && pCurOutData < pOutDataFastEnd && (pSrc + 10) <= pOutDataEnd) {
               memcpy(pCurOutData, pSrc, 8);
               memcpy(pCurOutData + 8, pSrc + 8, 2);
               pCurOutData += (MIN_MATCH_SIZE_V2 + nMatchLen);
@ -209,27 +209,32 @@ int lzsa_decompressor_expand_block_v2(const unsigned char *pInBlock, int nBlockS
                     break;
               }

-               if ((pCurOutData + nMatchLen) <= pOutDataEnd) {
-                  /* Do a deterministic, left to right byte copy instead of memcpy() so as to handle overlaps */
+               if ((pSrc + nMatchLen) <= pOutDataEnd) {
+                  if ((pCurOutData + nMatchLen) <= pOutDataEnd) {
+                     /* Do a deterministic, left to right byte copy instead of memcpy() so as to handle overlaps */

-                  if (nMatchOffset >= 16 && (pCurOutData + nMatchLen) < (pOutDataFastEnd - 15)) {
-                     const unsigned char *pCopySrc = pSrc;
-                     unsigned char *pCopyDst = pCurOutData;
-                     const unsigned char *pCopyEndDst = pCurOutData + nMatchLen;
+                     if (nMatchOffset >= 16 && (pCurOutData + nMatchLen) < (pOutDataFastEnd - 15)) {
+                        const unsigned char *pCopySrc = pSrc;
+                        unsigned char *pCopyDst = pCurOutData;
+                        const unsigned char *pCopyEndDst = pCurOutData + nMatchLen;

-                     do {
-                        memcpy(pCopyDst, pCopySrc, 16);
-                        pCopySrc += 16;
-                        pCopyDst += 16;
-                     } while (pCopyDst < pCopyEndDst);
+                        do {
+                           memcpy(pCopyDst, pCopySrc, 16);
+                           pCopySrc += 16;
+                           pCopyDst += 16;
+                        } while (pCopyDst < pCopyEndDst);

-                     pCurOutData += nMatchLen;
+                        pCurOutData += nMatchLen;
+                     }
+                     else {
+                        while (nMatchLen) {
+                           *pCurOutData++ = *pSrc++;
+                           nMatchLen--;
+                        }
+                     }
                  }
                  else {
-                     while (nMatchLen) {
-                        *pCurOutData++ = *pSrc++;
-                        nMatchLen--;
-                     }
+                     return -1;
                  }
               }
               else {
--- a/src/expand_block_v2.h
+++ b/src/expand_block_v2.h
@ -1,5 +1,5 @@
 /*
- * expand_v2.h - LZSA2 block decompressor definitions
+ * expand_block_v2.h - LZSA2 block decompressor definitions
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
@ -30,8 +30,8 @@
 *
 */

-#ifndef _EXPAND_V2_H
-#define _EXPAND_V2_H
+#ifndef _EXPAND_BLOCK_V2_H
+#define _EXPAND_BLOCK_V2_H

 /**
 * Decompress one LZSA2 data block
@ -46,4 +46,4 @@
 */
 int lzsa_decompressor_expand_block_v2(const unsigned char *pInBlock, int nBlockSize, unsigned char *pOutData, int nOutDataOffset, int nBlockMaxSize);

-#endif /* _EXPAND_V2_H */
+#endif /* _EXPAND_BLOCK_V2_H */
--- a/src/hashmap.c
+++ b/src/hashmap.c
@ -1,138 +0,0 @@
-/*
- * hashmap.c - integer hashmap implementation
- *
- * Copyright (C) 2019 Emmanuel Marty
- *
- * This software is provided 'as-is', without any express or implied
- * warranty.  In no event will the authors be held liable for any damages
- * arising from the use of this software.
- *
- * Permission is granted to anyone to use this software for any purpose,
- * including commercial applications, and to alter it and redistribute it
- * freely, subject to the following restrictions:
- *
- * 1. The origin of this software must not be misrepresented; you must not
- *    claim that you wrote the original software. If you use this software
- *    in a product, an acknowledgment in the product documentation would be
- *    appreciated but is not required.
- * 2. Altered source versions must be plainly marked as such, and must not be
- *    misrepresented as being the original software.
- * 3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * Uses the libdivsufsort library Copyright (c) 2003-2008 Yuta Mori
- *
- * Inspired by LZ4 by Yann Collet. https://github.com/lz4/lz4
- * With help, ideas, optimizations and speed measurements by spke <zxintrospec@gmail.com>
- * With ideas from Lizard by Przemyslaw Skibinski and Yann Collet. https://github.com/inikep/lizard
- * Also with ideas from smallz4 by Stephan Brumme. https://create.stephan-brumme.com/smallz4/
- *
- */
-
-#include <stdlib.h>
-#include <string.h>
-#include "hashmap.h"
-
-/**
- * Generate key hash by mixing
- *
- * @param key key to get hash for
- *
- * @return hash
- */
-static unsigned int lzsa_hashmap_get_hash(unsigned long long key) {
-   key = (~key) + (key << 21);
-   key = key ^ (key >> 24);
-   key = (key + (key << 3)) + (key << 8);
-   key = key ^ (key >> 14);
-   key = (key + (key << 2)) + (key << 4);
-   key = key ^ (key >> 28);
-   key = key + (key << 31);
-   return key & (LZSA_HASH_NBUCKETS - 1);
-}
-
-/**
- * Initialize hashmap
- *
- * @param pHashMap hashmap
- */
-void lzsa_hashmap_init(lzsa_hashmap_t *pHashMap) {
-   pHashMap->pBuffer = NULL;
-   memset(pHashMap->pBucket, 0, sizeof(lzsa_hashvalue_t *) * LZSA_HASH_NBUCKETS);
-}
-
-/**
- * Set value for key
- *
- * @param pHashMap hashmap
- * @param key key to set value for
- * @param value new value
- */
-void lzsa_hashmap_insert(lzsa_hashmap_t *pHashMap, unsigned long long key, unsigned int value) {
-   unsigned int hash = lzsa_hashmap_get_hash(key);
-   lzsa_hashvalue_t **pBucket = &pHashMap->pBucket[hash];
-   while (*pBucket) {
-      if ((*pBucket)->key == key) {
-         (*pBucket)->value = value;
-         return;
-      }
-
-      pBucket = &((*pBucket)->pNext);
-   }
-
-   if (!pHashMap->pBuffer || pHashMap->pBuffer->nFreeEntryIdx >= 255) {
-      lzsa_hashbuffer_t *pNewBuffer = (lzsa_hashbuffer_t *)malloc(sizeof(lzsa_hashbuffer_t));
-      if (!pNewBuffer) return;
-
-      pNewBuffer->pNext = pHashMap->pBuffer;
-      pNewBuffer->nFreeEntryIdx = 0;
-      pHashMap->pBuffer = pNewBuffer;
-   }
-
-   *pBucket = &pHashMap->pBuffer->value[pHashMap->pBuffer->nFreeEntryIdx++];
-   (*pBucket)->pNext = NULL;
-   (*pBucket)->key = key;
-   (*pBucket)->value = value;
-}
-
-/**
- * Get value for key
- *
- * @param pHashMap hashmap
- * @param key key to get value for
- * @param pValue pointer to where to store value if found
- *
- * @return 0 if found, nonzero if not found
- */
-int lzsa_hashmap_find(lzsa_hashmap_t *pHashMap, unsigned long long key, unsigned int *pValue) {
-   unsigned int hash = lzsa_hashmap_get_hash(key);
-   lzsa_hashvalue_t **pBucket = &pHashMap->pBucket[hash];
-   while (*pBucket) {
-      if ((*pBucket)->key == key) {
-         *pValue = (*pBucket)->value;
-         return 0;
-      }
-
-      pBucket = &((*pBucket)->pNext);
-   }
-
-   return -1;
-}
-
-/**
- * Clear hashmap
- *
- * @param pHashMap hashmap
- */
-void lzsa_hashmap_clear(lzsa_hashmap_t *pHashMap) {
-   while (pHashMap->pBuffer) {
-      lzsa_hashbuffer_t *pCurBuffer = pHashMap->pBuffer;
-      pHashMap->pBuffer = pCurBuffer->pNext;
-      free(pCurBuffer);
-      pCurBuffer = NULL;
-   }
-
-   memset(pHashMap->pBucket, 0, sizeof(lzsa_hashvalue_t *) * LZSA_HASH_NBUCKETS);
-}
-
--- a/src/hashmap.h
+++ b/src/hashmap.h
@ -1,99 +0,0 @@
-/*
- * hashmap.h - integer hashmap definitions
- *
- * Copyright (C) 2019 Emmanuel Marty
- *
- * This software is provided 'as-is', without any express or implied
- * warranty.  In no event will the authors be held liable for any damages
- * arising from the use of this software.
- *
- * Permission is granted to anyone to use this software for any purpose,
- * including commercial applications, and to alter it and redistribute it
- * freely, subject to the following restrictions:
- *
- * 1. The origin of this software must not be misrepresented; you must not
- *    claim that you wrote the original software. If you use this software
- *    in a product, an acknowledgment in the product documentation would be
- *    appreciated but is not required.
- * 2. Altered source versions must be plainly marked as such, and must not be
- *    misrepresented as being the original software.
- * 3. This notice may not be removed or altered from any source distribution.
- */
-
-/*
- * Uses the libdivsufsort library Copyright (c) 2003-2008 Yuta Mori
- *
- * Inspired by LZ4 by Yann Collet. https://github.com/lz4/lz4
- * With help, ideas, optimizations and speed measurements by spke <zxintrospec@gmail.com>
- * With ideas from Lizard by Przemyslaw Skibinski and Yann Collet. https://github.com/inikep/lizard
- * Also with ideas from smallz4 by Stephan Brumme. https://create.stephan-brumme.com/smallz4/
- *
- */
-
-#ifndef _HASHMAP_H
-#define _HASHMAP_H
-
-#include <stdlib.h>
-
-/** Number of hashmap buckets */
-#define LZSA_HASH_NBUCKETS 256
-
-/* Forward definitions */
-typedef struct _lzsa_hashvalue_t lzsa_hashvalue_t;
-typedef struct _lzsa_hashbuffer_t lzsa_hashbuffer_t;
-
-/** One hashmap bucket entry */
-typedef struct _lzsa_hashvalue_t {
-   lzsa_hashvalue_t *pNext;
-   unsigned long long key;
-   unsigned int value;
-} lzsa_hashvalue_t;
-
-/** One buffer storing hashmap bucket entries */
-typedef struct _lzsa_hashbuffer_t {
-   lzsa_hashbuffer_t *pNext;
-   int nFreeEntryIdx;
-   lzsa_hashvalue_t value[255];
-} lzsa_hashbuffer_t;
-
-/** Hashmap */
-typedef struct {
-   lzsa_hashbuffer_t *pBuffer;
-   lzsa_hashvalue_t *pBucket[LZSA_HASH_NBUCKETS];
-} lzsa_hashmap_t;
-
-/**
- * Initialize hashmap
- *
- * @param pHashMap hashmap
- */
-void lzsa_hashmap_init(lzsa_hashmap_t *pHashMap);
-
-/**
- * Set value for key
- *
- * @param pHashMap hashmap
- * @param key key to set value for
- * @param value new value
- */
-void lzsa_hashmap_insert(lzsa_hashmap_t *pHashMap, unsigned long long key, unsigned int value);
-
-/**
- * Get value for key
- *
- * @param pHashMap hashmap
- * @param key key to get value for
- * @param pValue pointer to where to store value if found
- *
- * @return 0 if found, nonzero if not found
- */
-int lzsa_hashmap_find(lzsa_hashmap_t *pHashMap, unsigned long long key, unsigned int *pValue);
-
-/**
- * Clear hashmap
- *
- * @param pHashMap hashmap
- */
-void lzsa_hashmap_clear(lzsa_hashmap_t *pHashMap);
-
-#endif   /* _HASHMAP_H */
--- a/src/lzsa.c
+++ b/src/lzsa.c
@ -46,8 +46,9 @@
 #define OPT_RAW            2
 #define OPT_FAVOR_RATIO    4
 #define OPT_RAW_BACKWARD   8
+#define OPT_STATS          16

-#define TOOL_VERSION "1.0.5"
+#define TOOL_VERSION "1.1.2"

 /*---------------------------------------------------------------------------*/

@ -104,6 +105,7 @@ static int do_compress(const char *pszInFilename, const char *pszOutFilename, co
   int nCommandCount = 0, nSafeDist = 0;
   int nFlags;
   lzsa_status_t nStatus;
+   lzsa_stats stats;

   nFlags = 0;
   if (nOptions & OPT_FAVOR_RATIO)
@ -117,7 +119,7 @@ static int do_compress(const char *pszInFilename, const char *pszOutFilename, co
      nStartTime = do_get_time();
   }

-   nStatus = lzsa_compress_file(pszInFilename, pszOutFilename, pszDictionaryFilename, nFlags, nMinMatchSize, nFormatVersion, compression_progress, &nOriginalSize, &nCompressedSize, &nCommandCount, &nSafeDist);
+   nStatus = lzsa_compress_file(pszInFilename, pszOutFilename, pszDictionaryFilename, nFlags, nMinMatchSize, nFormatVersion, compression_progress, &nOriginalSize, &nCompressedSize, &nCommandCount, &nSafeDist, &stats);

   if ((nOptions & OPT_VERBOSE)) {
      nEndTime = do_get_time();
@ -149,6 +151,32 @@ static int do_compress(const char *pszInFilename, const char *pszOutFilename, co
      }
   }

+   if (nOptions & OPT_STATS) {
+      if (stats.literals_divisor > 0)
+         fprintf(stdout, "Literals: min: %d avg: %d max: %d count: %d\n", stats.min_literals, stats.total_literals / stats.literals_divisor, stats.max_literals, stats.literals_divisor);
+      else
+         fprintf(stdout, "Literals: none\n");
+      if (stats.match_divisor > 0) {
+         fprintf(stdout, "Offsets: min: %d avg: %d max: %d reps: %d count: %d\n", stats.min_offset, stats.total_offsets / stats.match_divisor, stats.max_offset, stats.num_rep_offsets, stats.match_divisor);
+         fprintf(stdout, "Match lens: min: %d avg: %d max: %d count: %d\n", stats.min_match_len, stats.total_match_lens / stats.match_divisor, stats.max_match_len, stats.match_divisor);
+      }
+      else {
+         fprintf(stdout, "Offsets: none\n");
+         fprintf(stdout, "Match lens: none\n");
+      }
+      if (stats.rle1_divisor > 0) {
+         fprintf(stdout, "RLE1 lens: min: %d avg: %d max: %d count: %d\n", stats.min_rle1_len, stats.total_rle1_lens / stats.rle1_divisor, stats.max_rle1_len, stats.rle1_divisor);
+      }
+      else {
+         fprintf(stdout, "RLE1 lens: none\n");
+      }
+      if (stats.rle2_divisor > 0) {
+         fprintf(stdout, "RLE2 lens: min: %d avg: %d max: %d count: %d\n", stats.min_rle2_len, stats.total_rle2_lens / stats.rle2_divisor, stats.max_rle2_len, stats.rle2_divisor);
+      }
+      else {
+         fprintf(stdout, "RLE2 lens: none\n");
+      }
+   }
   return 0;
 }

@ -277,8 +305,10 @@ int comparestream_open(lzsa_stream_t *stream, const char *pszCompareFilename, co
      stream->close = comparestream_close;
      return 0;
   }
-   else
+   else {
+      free(pCompareStream);
      return -1;
+   }
 }

 static int do_compare(const char *pszInFilename, const char *pszOutFilename, const char *pszDictionaryFilename, const unsigned int nOptions, int nFormatVersion) {
@ -486,7 +516,7 @@ static int do_self_test(const unsigned int nOptions, const int nMinMatchSize, in
      float fMatchProbability;

      fprintf(stdout, "size %zd", nGeneratedDataSize);
-      for (fMatchProbability = ((nOptions & OPT_RAW) ? 0.5f : 0); fMatchProbability <= 0.995f; fMatchProbability += fProbabilitySizeStep) {
+      for (fMatchProbability = 0; fMatchProbability <= 0.995f; fMatchProbability += fProbabilitySizeStep) {
         int nNumLiteralValues[12] = { 1, 2, 3, 15, 30, 56, 96, 137, 178, 191, 255, 256 };
         float fXorProbability;

@ -1007,6 +1037,13 @@ int main(int argc, char **argv) {
         else
            bArgsError = true;
      }
+      else if (!strcmp(argv[i], "-stats")) {
+      if ((nOptions & OPT_STATS) == 0) {
+         nOptions |= OPT_STATS;
+      }
+      else
+         bArgsError = true;
+      }
      else {
         if (!pszInFilename)
            pszInFilename = argv[i];
@ -1036,6 +1073,7 @@ int main(int argc, char **argv) {
      fprintf(stderr, "  -cbench: benchmark in-memory compression\n");
      fprintf(stderr, "  -dbench: benchmark in-memory decompression\n");
      fprintf(stderr, "    -test: run automated self-tests\n");
+      fprintf(stderr, "   -stats: show compressed data stats\n");
      fprintf(stderr, "       -v: be verbose\n");
      fprintf(stderr, "       -f <value>: LZSA compression format (1-2)\n");
      fprintf(stderr, "       -r: raw block format (max. 64 Kb files)\n");
@ -1052,7 +1090,9 @@ int main(int argc, char **argv) {
   if (cCommand == 'z') {
      int nResult = do_compress(pszInFilename, pszOutFilename, pszDictionaryFilename, nOptions, nMinMatchSize, nFormatVersion);
      if (nResult == 0 && bVerifyCompression) {
-         nResult = do_compare(pszOutFilename, pszInFilename, pszDictionaryFilename, nOptions, nFormatVersion);
+         return do_compare(pszOutFilename, pszInFilename, pszDictionaryFilename, nOptions, nFormatVersion);
+      } else {
+         return nResult;
      }
   }
   else if (cCommand == 'd') {
--- a/src/matchfinder.c
+++ b/src/matchfinder.c
@ -35,6 +35,17 @@
 #include "format.h"
 #include "lib.h"

+/**
+ * Hash index into TAG_BITS
+ *
+ * @param nIndex index value
+ *
+ * @return hash
+ */
+static inline int lzsa_get_index_tag(unsigned int nIndex) {
+   return (int)(((unsigned long long)nIndex * 11400714819323198485ULL) >> (64ULL - TAG_BITS));
+}
+
 /**
 * Parse input data, build suffix array and overlaid data structures to speed up match finding
 *
@ -78,15 +89,33 @@ int lzsa_build_suffix_array(lzsa_compressor *pCompressor, const unsigned char *p
    * and the interval builder below doesn't need it either. */
   intervals[0] &= POS_MASK;
   int nMinMatchSize = pCompressor->min_match_size;
-   for (i = 1; i < nInWindowSize - 1; i++) {
-      int nIndex = (int)(intervals[i] & POS_MASK);
-      int nLen = PLCP[nIndex];
-      if (nLen < nMinMatchSize)
-         nLen = 0;
-      if (nLen > LCP_MAX)
-         nLen = LCP_MAX;
-      intervals[i] = ((unsigned int)nIndex) | (((unsigned int)nLen) << LCP_SHIFT);
+
+   if (pCompressor->format_version >= 2) {
+      for (i = 1; i < nInWindowSize - 1; i++) {
+         int nIndex = (int)(intervals[i] & POS_MASK);
+         int nLen = PLCP[nIndex];
+         if (nLen < nMinMatchSize)
+            nLen = 0;
+         if (nLen > LCP_MAX)
+            nLen = LCP_MAX;
+         int nTaggedLen = 0;
+         if (nLen)
+            nTaggedLen = (nLen << TAG_BITS) | (lzsa_get_index_tag((unsigned int)nIndex) & ((1 << TAG_BITS) - 1));
+         intervals[i] = ((unsigned int)nIndex) | (((unsigned int)nTaggedLen) << LCP_SHIFT);
+      }
   }
+   else {
+      for (i = 1; i < nInWindowSize - 1; i++) {
+         int nIndex = (int)(intervals[i] & POS_MASK);
+         int nLen = PLCP[nIndex];
+         if (nLen < nMinMatchSize)
+            nLen = 0;
+         if (nLen > LCP_AND_TAG_MAX)
+            nLen = LCP_AND_TAG_MAX;
+         intervals[i] = ((unsigned int)nIndex) | (((unsigned int)nLen) << LCP_SHIFT);
+      }
+   }
+
   if (i < nInWindowSize)
      intervals[i] &= POS_MASK;

@ -219,7 +248,12 @@ int lzsa_find_matches_at(lzsa_compressor *pCompressor, const int nOffset, lzsa_m
         int nMatchOffset = (int)(nOffset - match_pos);

         if (nMatchOffset <= MAX_OFFSET) {
-            matchptr->length = (unsigned short)(ref >> LCP_SHIFT);
+            if (pCompressor->format_version >= 2) {
+               matchptr->length = (unsigned short)(ref >> (LCP_SHIFT + TAG_BITS));
+            }
+            else {
+               matchptr->length = (unsigned short)(ref >> LCP_SHIFT);
+            }
            matchptr->offset = (unsigned short)nMatchOffset;
            matchptr++;
         }
@ -253,35 +287,26 @@ void lzsa_skip_matches(lzsa_compressor *pCompressor, const int nStartOffset, con
 }

 /**
- * Find all matches for the data to be compressed. Up to NMATCHES_PER_OFFSET matches are stored for each offset, for
- * the optimizer to look at.
+ * Find all matches for the data to be compressed
 *
 * @param pCompressor compression context
+ * @param nMatchesPerOffset maximum number of matches to store for each offset
 * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
 * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
 */
-void lzsa_find_all_matches(lzsa_compressor *pCompressor, const int nStartOffset, const int nEndOffset) {
-   lzsa_match *pMatch = pCompressor->match + (nStartOffset << MATCHES_PER_OFFSET_SHIFT);
+void lzsa_find_all_matches(lzsa_compressor *pCompressor, const int nMatchesPerOffset, const int nStartOffset, const int nEndOffset) {
+   lzsa_match *pMatch = pCompressor->match + (nStartOffset * nMatchesPerOffset);
   int i;

   for (i = nStartOffset; i < nEndOffset; i++) {
-      int nMatches = lzsa_find_matches_at(pCompressor, i, pMatch, NMATCHES_PER_OFFSET);
-      int m;
+      int nMatches = lzsa_find_matches_at(pCompressor, i, pMatch, nMatchesPerOffset);

-      for (m = 0; m < NMATCHES_PER_OFFSET; m++) {
-         if (nMatches <= m || i > (nEndOffset - LAST_MATCH_OFFSET)) {
-            pMatch->length = 0;
-            pMatch->offset = 0;
-         }
-         else {
-            int nMaxLen = (nEndOffset - LAST_LITERALS) - i;
-            if (nMaxLen < 0)
-               nMaxLen = 0;
-            if (pMatch->length > nMaxLen)
-               pMatch->length = (unsigned short)nMaxLen;
-         }
-
-         pMatch++;
+      while (nMatches < nMatchesPerOffset) {
+         pMatch[nMatches].length = 0;
+         pMatch[nMatches].offset = 0;
+         nMatches++;
      }
+
+      pMatch += nMatchesPerOffset;
   }
 }
--- a/src/matchfinder.h
+++ b/src/matchfinder.h
@ -74,14 +74,14 @@ int lzsa_find_matches_at(lzsa_compressor *pCompressor, const int nOffset, lzsa_m
 void lzsa_skip_matches(lzsa_compressor *pCompressor, const int nStartOffset, const int nEndOffset);

 /**
- * Find all matches for the data to be compressed. Up to NMATCHES_PER_OFFSET matches are stored for each offset, for
- * the optimizer to look at.
+ * Find all matches for the data to be compressed
 *
 * @param pCompressor compression context
+ * @param nMatchesPerOffset maximum number of matches to store for each offset
 * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
 * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
 */
-void lzsa_find_all_matches(lzsa_compressor *pCompressor, const int nStartOffset, const int nEndOffset);
+void lzsa_find_all_matches(lzsa_compressor *pCompressor, const int nMatchesPerOffset, const int nStartOffset, const int nEndOffset);

 #ifdef __cplusplus
 }
--- a/src/shrink_block_v1.c
+++ b/src/shrink_block_v1.c
@ -1,5 +1,5 @@
 /*
- * shrink_v1.c - LZSA1 block compressor implementation
+ * shrink_block_v1.c - LZSA1 block compressor implementation
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
@ -139,110 +139,143 @@ static inline int lzsa_write_match_varlen_v1(unsigned char *pOutData, int nOutOf
 }

 /**
- * Attempt to pick optimal matches, so as to produce the smallest possible output that decompresses to the same input
+ * Get offset encoding cost in bits
+ *
+ * @param nMatchOffset offset to get cost of
+ *
+ * @return cost in bits
+ */
+static inline int lzsa_get_offset_cost_v1(const unsigned int nMatchOffset) {
+   return (nMatchOffset <= 256) ? 8 : 16;
+}
+
+/**
+ * Attempt to pick optimal matches using a forward arrivals parser, so as to produce the smallest possible output that decompresses to the same input
 *
 * @param pCompressor compression context
 * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
 * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
 */
-static void lzsa_optimize_matches_v1(lzsa_compressor *pCompressor, const int nStartOffset, const int nEndOffset) {
-   int *cost = (int*)pCompressor->pos_data;  /* Reuse */
-   int nLastLiteralsOffset;
-   int nMinMatchSize = pCompressor->min_match_size;
+static void lzsa_optimize_forward_v1(lzsa_compressor *pCompressor, lzsa_match *pBestMatch, const int nStartOffset, const int nEndOffset, const int nReduce) {
+   lzsa_arrival *arrival = pCompressor->arrival;
+   const int nMinMatchSize = pCompressor->min_match_size;
   const int nFavorRatio = (pCompressor->flags & LZSA_FLAG_FAVOR_RATIO) ? 1 : 0;
-   int i;
+   const int nDisableScore = nReduce ? 0 : (2 * BLOCK_SIZE);
+   int i, j, n;

-   cost[nEndOffset - 1] = 8;
-   nLastLiteralsOffset = nEndOffset;
+   memset(arrival + (nStartOffset << MATCHES_PER_OFFSET_SHIFT), 0, sizeof(lzsa_arrival) * ((nEndOffset - nStartOffset) << MATCHES_PER_OFFSET_SHIFT));

-   for (i = nEndOffset - 2; i != (nStartOffset - 1); i--) {
-      int nBestCost, nBestMatchLen, nBestMatchOffset;
+   arrival[nStartOffset << MATCHES_PER_OFFSET_SHIFT].from_slot = -1;

-      int nLiteralsLen = nLastLiteralsOffset - i;
-      nBestCost = 8 + cost[i + 1];
-      if (nLiteralsLen == LITERALS_RUN_LEN_V1 || nLiteralsLen == 256 || nLiteralsLen == 512) {
-         /* Add to the cost of encoding literals as their number crosses a variable length encoding boundary.
-          * The cost automatically accumulates down the chain. */
-         nBestCost += 8;
-      }
-      if (pCompressor->match[(i + 1) << MATCHES_PER_OFFSET_SHIFT].length >= MIN_MATCH_SIZE_V1)
-         nBestCost += MODESWITCH_PENALTY;
-      nBestMatchLen = 0;
-      nBestMatchOffset = 0;
-
-      lzsa_match *pMatch = pCompressor->match + (i << MATCHES_PER_OFFSET_SHIFT);
+   for (i = nStartOffset; i != (nEndOffset - 1); i++) {
      int m;

-      for (m = 0; m < NMATCHES_PER_OFFSET && pMatch[m].length >= nMinMatchSize; m++) {
-         int nMatchOffsetSize = (pMatch[m].offset <= 256) ? 8 : 16;
+      for (j = 0; j < NMATCHES_PER_OFFSET && arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].from_slot; j++) {
+         int nPrevCost = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].cost;
+         int nCodingChoiceCost = nPrevCost + 8 /* literal */;
+         int nScore = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].score + 1;
+         int nNumLiterals = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].num_literals + 1;

-         if (pMatch[m].length >= LEAVE_ALONE_MATCH_SIZE) {
-            int nCurCost;
-            int nMatchLen = pMatch[m].length;
-
-            if ((i + nMatchLen) > (nEndOffset - LAST_LITERALS))
-               nMatchLen = nEndOffset - LAST_LITERALS - i;
-
-            nCurCost = 8 + nMatchOffsetSize + lzsa_get_match_varlen_size_v1(nMatchLen - MIN_MATCH_SIZE_V1);
-            nCurCost += cost[i + nMatchLen];
-            if (pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].length >= MIN_MATCH_SIZE_V1)
-               nCurCost += MODESWITCH_PENALTY;
-
-            if (nBestCost > (nCurCost - nFavorRatio)) {
-               nBestCost = nCurCost;
-               nBestMatchLen = nMatchLen;
-               nBestMatchOffset = pMatch[m].offset;
-            }
+         if (nNumLiterals == LITERALS_RUN_LEN_V1 || nNumLiterals == 256 || nNumLiterals == 512) {
+            nCodingChoiceCost += 8;
         }
-         else {
-            int nMatchLen = pMatch[m].length;
-            int k, nMatchRunLen;

-            if ((i + nMatchLen) > (nEndOffset - LAST_LITERALS))
-               nMatchLen = nEndOffset - LAST_LITERALS - i;
+         if (!nFavorRatio && nNumLiterals == 1)
+            nCodingChoiceCost += MODESWITCH_PENALTY;

-            nMatchRunLen = nMatchLen;
-            if (nMatchRunLen > MATCH_RUN_LEN_V1)
-               nMatchRunLen = MATCH_RUN_LEN_V1;
+         for (n = 0; n < NMATCHES_PER_OFFSET /* we only need the literals + short match cost + long match cost cases */; n++) {
+            lzsa_arrival *pDestArrival = &arrival[((i + 1) << MATCHES_PER_OFFSET_SHIFT) + n];

-            for (k = nMinMatchSize; k < nMatchRunLen; k++) {
-               int nCurCost;
+            if (pDestArrival->from_slot == 0 ||
+               nCodingChoiceCost < pDestArrival->cost ||
+               (nCodingChoiceCost == pDestArrival->cost && nScore < (pDestArrival->score + nDisableScore))) {
+               memmove(&arrival[((i + 1) << MATCHES_PER_OFFSET_SHIFT) + n + 1],
+                  &arrival[((i + 1) << MATCHES_PER_OFFSET_SHIFT) + n],
+                  sizeof(lzsa_arrival) * (NMATCHES_PER_OFFSET - n - 1));

-               nCurCost = 8 + nMatchOffsetSize /* no extra match len bytes */;
-               nCurCost += cost[i + k];
-               if (pCompressor->match[(i + k) << MATCHES_PER_OFFSET_SHIFT].length >= MIN_MATCH_SIZE_V1)
-                  nCurCost += MODESWITCH_PENALTY;
-
-               if (nBestCost > (nCurCost - nFavorRatio)) {
-                  nBestCost = nCurCost;
-                  nBestMatchLen = k;
-                  nBestMatchOffset = pMatch[m].offset;
-               }
-            }
-
-            for (; k <= nMatchLen; k++) {
-               int nCurCost;
-
-               nCurCost = 8 + nMatchOffsetSize + lzsa_get_match_varlen_size_v1(k - MIN_MATCH_SIZE_V1);
-               nCurCost += cost[i + k];
-               if (pCompressor->match[(i + k) << MATCHES_PER_OFFSET_SHIFT].length >= MIN_MATCH_SIZE_V1)
-                  nCurCost += MODESWITCH_PENALTY;
-
-               if (nBestCost > (nCurCost - nFavorRatio)) {
-                  nBestCost = nCurCost;
-                  nBestMatchLen = k;
-                  nBestMatchOffset = pMatch[m].offset;
-               }
+               pDestArrival->cost = nCodingChoiceCost;
+               pDestArrival->from_pos = i;
+               pDestArrival->from_slot = j + 1;
+               pDestArrival->match_offset = 0;
+               pDestArrival->match_len = 0;
+               pDestArrival->num_literals = nNumLiterals;
+               pDestArrival->score = nScore;
+               pDestArrival->rep_offset = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].rep_offset;
+               break;
            }
         }
      }

-      if (nBestMatchLen >= MIN_MATCH_SIZE_V1)
-         nLastLiteralsOffset = i;
+      const lzsa_match *match = pCompressor->match + (i << 3);

-      cost[i] = nBestCost;
-      pMatch->length = nBestMatchLen;
-      pMatch->offset = nBestMatchOffset;
+      for (m = 0; m < 8 && match[m].length; m++) {
+         int nMatchLen = match[m].length;
+         int nMatchOffsetCost = lzsa_get_offset_cost_v1(match[m].offset);
+         int nStartingMatchLen, k;
+
+         if ((i + nMatchLen) > (nEndOffset - LAST_LITERALS))
+            nMatchLen = nEndOffset - LAST_LITERALS - i;
+
+         if (nMatchLen >= LEAVE_ALONE_MATCH_SIZE)
+            nStartingMatchLen = nMatchLen;
+         else
+            nStartingMatchLen = nMinMatchSize;
+         for (k = nStartingMatchLen; k <= nMatchLen; k++) {
+            int nMatchLenCost = lzsa_get_match_varlen_size_v1(k - MIN_MATCH_SIZE_V1);
+
+            for (j = 0; j < NMATCHES_PER_OFFSET && arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].from_slot; j++) {
+               int nPrevCost = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].cost;
+               int nCodingChoiceCost = nPrevCost + 8 /* token */ /* the actual cost of the literals themselves accumulates up the chain */ + nMatchOffsetCost + nMatchLenCost;
+               int nScore = arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].score + 5;
+               int exists = 0;
+
+               if (!nFavorRatio && !arrival[(i << MATCHES_PER_OFFSET_SHIFT) + j].num_literals)
+                  nCodingChoiceCost += MODESWITCH_PENALTY;
+
+               for (n = 0;
+                  n < NMATCHES_PER_OFFSET && arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n].from_slot && arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n].cost <= nCodingChoiceCost;
+                  n++) {
+                  if (lzsa_get_offset_cost_v1(arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n].rep_offset) == lzsa_get_offset_cost_v1(match[m].offset)) {
+                     exists = 1;
+                     break;
+                  }
+               }
+
+               for (n = 0; !exists && n < NMATCHES_PER_OFFSET /* we only need the literals + short match cost + long match cost cases */; n++) {
+                  lzsa_arrival *pDestArrival = &arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n];
+
+                  if (pDestArrival->from_slot == 0 ||
+                     nCodingChoiceCost < pDestArrival->cost ||
+                     (nCodingChoiceCost == pDestArrival->cost && nScore < (pDestArrival->score + nDisableScore))) {
+                     memmove(&arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n + 1],
+                        &arrival[((i + k) << MATCHES_PER_OFFSET_SHIFT) + n],
+                        sizeof(lzsa_arrival) * (NMATCHES_PER_OFFSET - n - 1));
+
+                     pDestArrival->cost = nCodingChoiceCost;
+                     pDestArrival->from_pos = i;
+                     pDestArrival->from_slot = j + 1;
+                     pDestArrival->match_offset = match[m].offset;
+                     pDestArrival->match_len = k;
+                     pDestArrival->num_literals = 0;
+                     pDestArrival->score = nScore;
+                     pDestArrival->rep_offset = match[m].offset;
+                     break;
+                  }
+               }
+            }
+         }
+      }
+   }
+
+   lzsa_arrival *end_arrival = &arrival[(i << MATCHES_PER_OFFSET_SHIFT) + 0];
+   pBestMatch[i].length = 0;
+   pBestMatch[i].offset = 0;
+
+   while (end_arrival->from_slot > 0 && end_arrival->from_pos >= 0) {
+      pBestMatch[end_arrival->from_pos].length = end_arrival->match_len;
+      pBestMatch[end_arrival->from_pos].offset = end_arrival->match_offset;
+
+      end_arrival = &arrival[(end_arrival->from_pos << MATCHES_PER_OFFSET_SHIFT) + (end_arrival->from_slot - 1)];
   }
 }

@ -251,80 +284,63 @@ static void lzsa_optimize_matches_v1(lzsa_compressor *pCompressor, const int nSt
 * impacting the compression ratio
 *
 * @param pCompressor compression context
+ * @param pBestMatch optimal matches to emit
 * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
 * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
 *
 * @return non-zero if the number of tokens was reduced, 0 if it wasn't
 */
-static int lzsa_optimize_command_count_v1(lzsa_compressor *pCompressor, const int nStartOffset, const int nEndOffset) {
+static int lzsa_optimize_command_count_v1(lzsa_compressor *pCompressor, lzsa_match *pBestMatch, const int nStartOffset, const int nEndOffset) {
   int i;
   int nNumLiterals = 0;
   int nDidReduce = 0;

   for (i = nStartOffset; i < nEndOffset; ) {
-      lzsa_match *pMatch = pCompressor->match + (i << MATCHES_PER_OFFSET_SHIFT);
+      lzsa_match *pMatch = pBestMatch + i;

      if (pMatch->length >= MIN_MATCH_SIZE_V1) {
-         int nMatchLen = pMatch->length;
-         int nReduce = 0;
+         if (pMatch->length <= 9 /* Don't waste time considering large matches, they will always win over literals */ &&
+            (i + pMatch->length) < nEndOffset /* Don't consider the last token in the block, we can only reduce a match inbetween other tokens */) {
+            int nNextIndex = i + pMatch->length;
+            int nNextLiterals = 0;

-         if (nMatchLen <= 9 && (i + nMatchLen) < nEndOffset) /* max reducable command size: <token> <EE> <ll> <ll> <offset> <offset> <EE> <mm> <mm> */ {
-            int nMatchOffset = pMatch->offset;
-            int nEncodedMatchLen = nMatchLen - MIN_MATCH_SIZE_V1;
-            int nCommandSize = 8 /* token */ + lzsa_get_literals_varlen_size_v1(nNumLiterals) + ((nMatchOffset <= 256) ? 8 : 16) /* match offset */ + lzsa_get_match_varlen_size_v1(nEncodedMatchLen);
+            while (nNextIndex < nEndOffset && pBestMatch[nNextIndex].length < MIN_MATCH_SIZE_V1) {
+               nNextLiterals++;
+               nNextIndex++;
+            }

-            if (pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].length >= MIN_MATCH_SIZE_V1) {
-               if (nCommandSize >= ((nMatchLen << 3) + lzsa_get_literals_varlen_size_v1(nNumLiterals + nMatchLen))) {
-                  /* This command is a match; the next command is also a match. The next command currently has no literals; replacing this command by literals will
-                   * make the next command eat the cost of encoding the current number of literals, + nMatchLen extra literals. The size of the current match command is
-                   * at least as much as the number of literal bytes + the extra cost of encoding them in the next match command, so we can safely replace the current
-                   * match command by literals, the output size will not increase and it will remove one command. */
-                  nReduce = 1;
+            /* This command is a match, is followed by 'nNextLiterals' literals and then by another match, or the end of the input. Calculate this command's current cost (excluding 'nNumLiterals' bytes) */
+            if ((8 /* token */ + lzsa_get_literals_varlen_size_v1(nNumLiterals) + ((pMatch->offset <= 256) ? 8 : 16) /* match offset */ + lzsa_get_match_varlen_size_v1(pMatch->length - MIN_MATCH_SIZE_V1) +
+               8 /* token */ + lzsa_get_literals_varlen_size_v1(nNextLiterals)) >=
+               (8 /* token */ + (pMatch->length << 3) + lzsa_get_literals_varlen_size_v1(nNumLiterals + pMatch->length + nNextLiterals))) {
+               /* Reduce */
+               int nMatchLen = pMatch->length;
+               int j;
+
+               for (j = 0; j < nMatchLen; j++) {
+                  pBestMatch[i + j].length = 0;
               }
-            }
-            else {
-               int nCurIndex = i + nMatchLen;
-               int nNextNumLiterals = 0;

-               do {
-                  nCurIndex++;
-                  nNextNumLiterals++;
-               } while (nCurIndex < nEndOffset && pCompressor->match[nCurIndex << MATCHES_PER_OFFSET_SHIFT].length < MIN_MATCH_SIZE_V1);
-
-               if (nCommandSize >= ((nMatchLen << 3) + lzsa_get_literals_varlen_size_v1(nNumLiterals + nNextNumLiterals + nMatchLen) - lzsa_get_literals_varlen_size_v1(nNextNumLiterals))) {
-                  /* This command is a match, and is followed by literals, and then another match or the end of the input data. If encoding this match as literals doesn't take
-                   * more room than the match, and doesn't grow the next match command's literals encoding, go ahead and remove the command. */
-                  nReduce = 1;
-               }
-            }
-         }
-
-         if (nReduce) {
-            int j;
-
-            for (j = 0; j < nMatchLen; j++) {
-               pCompressor->match[(i + j) << MATCHES_PER_OFFSET_SHIFT].length = 0;
-            }
-            nNumLiterals += nMatchLen;
-            i += nMatchLen;
-
-            nDidReduce = 1;
-         }
-         else {
-            if ((i + nMatchLen) < nEndOffset && nMatchLen >= LCP_MAX &&
-               pMatch->offset && pMatch->offset <= 32 && pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].offset == pMatch->offset && (nMatchLen % pMatch->offset) == 0 &&
-               (nMatchLen + pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].length) <= MAX_VARLEN) {
-               /* Join */
-
-               pMatch->length += pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].length;
-               pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].offset = 0;
-               pCompressor->match[(i + nMatchLen) << MATCHES_PER_OFFSET_SHIFT].length = -1;
+               nDidReduce = 1;
               continue;
            }
-
-            nNumLiterals = 0;
-            i += nMatchLen;
         }
+
+         if ((i + pMatch->length) < nEndOffset && pMatch->length >= LCP_MAX &&
+            pMatch->offset && pMatch->offset <= 32 && pBestMatch[i + pMatch->length].offset == pMatch->offset && (pMatch->length % pMatch->offset) == 0 &&
+            (pMatch->length + pBestMatch[i + pMatch->length].length) <= MAX_VARLEN) {
+            int nMatchLen = pMatch->length;
+
+            /* Join */
+
+            pMatch->length += pBestMatch[i + nMatchLen].length;
+            pBestMatch[i + nMatchLen].offset = 0;
+            pBestMatch[i + nMatchLen].length = -1;
+            continue;
+         }
+
+         i += pMatch->length;
+         nNumLiterals = 0;
      }
      else {
         nNumLiterals++;
@ -335,10 +351,63 @@ static int lzsa_optimize_command_count_v1(lzsa_compressor *pCompressor, const in
   return nDidReduce;
 }

+/**
+ * Get compressed data block size
+ *
+ * @param pCompressor compression context
+ * @param pBestMatch optimal matches to emit
+ * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
+ * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
+ *
+ * @return size of compressed data that will be written to output buffer
+ */
+static int lzsa_get_compressed_size_v1(lzsa_compressor *pCompressor, lzsa_match *pBestMatch, const int nStartOffset, const int nEndOffset) {
+   int i;
+   int nNumLiterals = 0;
+   int nCompressedSize = 0;
+
+   for (i = nStartOffset; i < nEndOffset; ) {
+      const lzsa_match *pMatch = pBestMatch + i;
+
+      if (pMatch->length >= MIN_MATCH_SIZE_V1) {
+         int nMatchOffset = pMatch->offset;
+         int nMatchLen = pMatch->length;
+         int nEncodedMatchLen = nMatchLen - MIN_MATCH_SIZE_V1;
+         int nTokenLiteralsLen = (nNumLiterals >= LITERALS_RUN_LEN_V1) ? LITERALS_RUN_LEN_V1 : nNumLiterals;
+         int nTokenMatchLen = (nEncodedMatchLen >= MATCH_RUN_LEN_V1) ? MATCH_RUN_LEN_V1 : nEncodedMatchLen;
+         int nTokenLongOffset = (nMatchOffset <= 256) ? 0x00 : 0x80;
+         int nCommandSize = 8 /* token */ + lzsa_get_literals_varlen_size_v1(nNumLiterals) + (nNumLiterals << 3) + (nTokenLongOffset ? 16 : 8) /* match offset */ + lzsa_get_match_varlen_size_v1(nEncodedMatchLen);
+
+         nCompressedSize += nCommandSize;
+         nNumLiterals = 0;
+         i += nMatchLen;
+      }
+      else {
+         nNumLiterals++;
+         i++;
+      }
+   }
+
+   {
+      int nTokenLiteralsLen = (nNumLiterals >= LITERALS_RUN_LEN_V1) ? LITERALS_RUN_LEN_V1 : nNumLiterals;
+      int nCommandSize = 8 /* token */ + lzsa_get_literals_varlen_size_v1(nNumLiterals) + (nNumLiterals << 3);
+
+      nCompressedSize += nCommandSize;
+      nNumLiterals = 0;
+   }
+
+   if (pCompressor->flags & LZSA_FLAG_RAW_BLOCK) {
+      nCompressedSize += 8 * 4;
+   }
+
+   return nCompressedSize;
+}
+
 /**
 * Emit block of compressed data
 *
 * @param pCompressor compression context
+ * @param pBestMatch optimal matches to emit
 * @param pInWindow pointer to input data window (previously compressed bytes + bytes to compress)
 * @param nStartOffset current offset in input window (typically the number of previously compressed bytes)
 * @param nEndOffset offset to end finding matches at (typically the size of the total input window in bytes
@ -347,14 +416,14 @@ static int lzsa_optimize_command_count_v1(lzsa_compressor *pCompressor, const in
 *
 * @return size of compressed data in output buffer, or -1 if the data is uncompressible
 */
-static int lzsa_write_block_v1(lzsa_compressor *pCompressor, const unsigned char *pInWindow, const int nStartOffset, const int nEndOffset, unsigned char *pOutData, const int nMaxOutDataSize) {
+static int lzsa_write_block_v1(lzsa_compressor *pCompressor, lzsa_match *pBestMatch, const unsigned char *pInWindow, const int nStartOffset, const int nEndOffset, unsigned char *pOutData, const int nMaxOutDataSize) {
   int i;
   int nNumLiterals = 0;
   int nInFirstLiteralOffset = 0;
   int nOutOffset = 0;

   for (i = nStartOffset; i < nEndOffset; ) {
-      lzsa_match *pMatch = pCompressor->match + (i << MATCHES_PER_OFFSET_SHIFT);
+      const lzsa_match *pMatch = pBestMatch + i;

      if (pMatch->length >= MIN_MATCH_SIZE_V1) {
         int nMatchOffset = pMatch->offset;
@ -373,6 +442,13 @@ static int lzsa_write_block_v1(lzsa_compressor *pCompressor, const unsigned char
         pOutData[nOutOffset++] = nTokenLongOffset | (nTokenLiteralsLen << 4) | nTokenMatchLen;
         nOutOffset = lzsa_write_literals_varlen_v1(pOutData, nOutOffset, nNumLiterals);

+         if (nNumLiterals < pCompressor->stats.min_literals || pCompressor->stats.min_literals == -1)
+            pCompressor->stats.min_literals = nNumLiterals;
+         if (nNumLiterals > pCompressor->stats.max_literals)
+            pCompressor->stats.max_literals = nNumLiterals;
+         pCompressor->stats.total_literals += nNumLiterals;
+         pCompressor->stats.literals_divisor++;
+
         if (nNumLiterals != 0) {
            memcpy(pOutData + nOutOffset, pInWindow + nInFirstLiteralOffset, nNumLiterals);
            nOutOffset += nNumLiterals;
@ -384,6 +460,37 @@ static int lzsa_write_block_v1(lzsa_compressor *pCompressor, const unsigned char
            pOutData[nOutOffset++] = (-nMatchOffset) >> 8;
         }
         nOutOffset = lzsa_write_match_varlen_v1(pOutData, nOutOffset, nEncodedMatchLen);
+
+         if (nMatchOffset < pCompressor->stats.min_offset || pCompressor->stats.min_offset == -1)
+            pCompressor->stats.min_offset = nMatchOffset;
+         if (nMatchOffset > pCompressor->stats.max_offset)
+            pCompressor->stats.max_offset = nMatchOffset;
+         pCompressor->stats.total_offsets += nMatchOffset;
+
+         if (nMatchLen < pCompressor->stats.min_match_len || pCompressor->stats.min_match_len == -1)
+            pCompressor->stats.min_match_len = nMatchLen;
+         if (nMatchLen > pCompressor->stats.max_match_len)
+            pCompressor->stats.max_match_len = nMatchLen;
+         pCompressor->stats.total_match_lens += nMatchLen;
+         pCompressor->stats.match_divisor++;
+
+         if (nMatchOffset == 1) {
+            if (nMatchLen < pCompressor->stats.min_rle1_len || pCompressor->stats.min_rle1_len == -1)
+               pCompressor->stats.min_rle1_len = nMatchLen;
+            if (nMatchLen > pCompressor->stats.max_rle1_len)
+               pCompressor->stats.max_rle1_len = nMatchLen;
+            pCompressor->stats.total_rle1_lens += nMatchLen;
+            pCompressor->stats.rle1_divisor++;
+         }
+         else if (nMatchOffset == 2) {
+            if (nMatchLen < pCompressor->stats.min_rle2_len || pCompressor->stats.min_rle2_len == -1)
+               pCompressor->stats.min_rle2_len = nMatchLen;
+            if (nMatchLen > pCompressor->stats.max_rle2_len)
+               pCompressor->stats.max_rle2_len = nMatchLen;
+            pCompressor->stats.total_rle2_lens += nMatchLen;
+            pCompressor->stats.rle2_divisor++;
+         }
+
         i += nMatchLen;

         if (pCompressor->flags & LZSA_FLAG_RAW_BLOCK) {
@ -415,6 +522,13 @@ static int lzsa_write_block_v1(lzsa_compressor *pCompressor, const unsigned char
         pOutData[nOutOffset++] = (nTokenLiteralsLen << 4) | 0x00;
      nOutOffset = lzsa_write_literals_varlen_v1(pOutData, nOutOffset, nNumLiterals);

+      if (nNumLiterals < pCompressor->stats.min_literals || pCompressor->stats.min_literals == -1)
+         pCompressor->stats.min_literals = nNumLiterals;
+      if (nNumLiterals > pCompressor->stats.max_literals)
+         pCompressor->stats.max_literals = nNumLiterals;
+      pCompressor->stats.total_literals += nNumLiterals;
+      pCompressor->stats.literals_divisor++;
+
      if (nNumLiterals != 0) {
         memcpy(pOutData + nOutOffset, pInWindow + nInFirstLiteralOffset, nNumLiterals);
         nOutOffset += nNumLiterals;
@ -502,18 +616,42 @@ static int lzsa_write_raw_uncompressed_block_v1(lzsa_compressor *pCompressor, co
 * @return size of compressed data in output buffer, or -1 if the data is uncompressible
 */
 int lzsa_optimize_and_write_block_v1(lzsa_compressor *pCompressor, const unsigned char *pInWindow, const int nPreviousBlockSize, const int nInDataSize, unsigned char *pOutData, const int nMaxOutDataSize) {
-   int nResult;
+   int nResult, nBaseCompressedSize;

-   lzsa_optimize_matches_v1(pCompressor, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+   /* Compress optimally without breaking ties in favor of less tokens */
+
+   lzsa_optimize_forward_v1(pCompressor, pCompressor->best_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize, 0 /* reduce */);

   int nDidReduce;
   int nPasses = 0;
   do {
-      nDidReduce = lzsa_optimize_command_count_v1(pCompressor, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+      nDidReduce = lzsa_optimize_command_count_v1(pCompressor, pCompressor->best_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
      nPasses++;
   } while (nDidReduce && nPasses < 20);

-   nResult = lzsa_write_block_v1(pCompressor, pInWindow, nPreviousBlockSize, nPreviousBlockSize + nInDataSize, pOutData, nMaxOutDataSize);
+   nBaseCompressedSize = lzsa_get_compressed_size_v1(pCompressor, pCompressor->best_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+   lzsa_match *pBestMatch = pCompressor->best_match;
+
+   if (nBaseCompressedSize > 0 && nInDataSize < 65536) {
+      int nReducedCompressedSize;
+
+      /* Compress optimally and do break ties in favor of less tokens */
+      lzsa_optimize_forward_v1(pCompressor, pCompressor->improved_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize, 1 /* reduce */);
+      
+      nPasses = 0;
+      do {
+         nDidReduce = lzsa_optimize_command_count_v1(pCompressor, pCompressor->improved_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+         nPasses++;
+      } while (nDidReduce && nPasses < 20);
+
+      nReducedCompressedSize = lzsa_get_compressed_size_v1(pCompressor, pCompressor->improved_match, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+      if (nReducedCompressedSize > 0 && nReducedCompressedSize <= nBaseCompressedSize) {
+         /* Pick the parse with the reduced number of tokens as it didn't negatively affect the size */
+         pBestMatch = pCompressor->improved_match;
+      }
+   }
+
+   nResult = lzsa_write_block_v1(pCompressor, pBestMatch, pInWindow, nPreviousBlockSize, nPreviousBlockSize + nInDataSize, pOutData, nMaxOutDataSize);
   if (nResult < 0 && pCompressor->flags & LZSA_FLAG_RAW_BLOCK) {
      nResult = lzsa_write_raw_uncompressed_block_v1(pCompressor, pInWindow, nPreviousBlockSize, nPreviousBlockSize + nInDataSize, pOutData, nMaxOutDataSize);
   }
--- a/src/shrink_block_v1.h
+++ b/src/shrink_block_v1.h
@ -1,5 +1,5 @@
 /*
- * shrink_v1.h - LZSA1 block compressor definitions
+ * shrink_block_v1.h - LZSA1 block compressor definitions
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
--- a/src/shrink_block_v2.c
+++ b/src/shrink_block_v2.c
--- a/src/shrink_block_v2.h
+++ b/src/shrink_block_v2.h
@ -1,5 +1,5 @@
 /*
- * shrink_v2.h - LZSA2 block compressor definitions
+ * shrink_block_v2.h - LZSA2 block compressor definitions
 *
 * Copyright (C) 2019 Emmanuel Marty
 *
--- a/src/shrink_context.c
+++ b/src/shrink_context.c
@ -59,21 +59,25 @@ int lzsa_compressor_init(lzsa_compressor *pCompressor, const int nMaxWindowSize,
   pCompressor->pos_data = NULL;
   pCompressor->open_intervals = NULL;
   pCompressor->match = NULL;
-   pCompressor->selected_match = NULL;
   pCompressor->best_match = NULL;
   pCompressor->improved_match = NULL;
-   pCompressor->slot_cost = NULL;
-   pCompressor->repmatch_opt = NULL;
+   pCompressor->arrival = NULL;
   pCompressor->min_match_size = nMinMatchSize;
   if (pCompressor->min_match_size < nMinMatchSizeForFormat)
      pCompressor->min_match_size = nMinMatchSizeForFormat;
   else if (pCompressor->min_match_size > nMaxMinMatchForFormat)
      pCompressor->min_match_size = nMaxMinMatchForFormat;
-   pCompressor->max_forward_depth = 0;
   pCompressor->format_version = nFormatVersion;
   pCompressor->flags = nFlags;
   pCompressor->safe_dist = 0;
   pCompressor->num_commands = 0;
+   
+   memset(&pCompressor->stats, 0, sizeof(pCompressor->stats));
+   pCompressor->stats.min_literals = -1;
+   pCompressor->stats.min_match_len = -1;
+   pCompressor->stats.min_offset = -1;
+   pCompressor->stats.min_rle1_len = -1;
+   pCompressor->stats.min_rle2_len = -1;

   if (!nResult) {
      pCompressor->intervals = (unsigned int *)malloc(nMaxWindowSize * sizeof(unsigned int));
@ -82,37 +86,26 @@ int lzsa_compressor_init(lzsa_compressor *pCompressor, const int nMaxWindowSize,
         pCompressor->pos_data = (unsigned int *)malloc(nMaxWindowSize * sizeof(unsigned int));

         if (pCompressor->pos_data) {
-            pCompressor->open_intervals = (unsigned int *)malloc((LCP_MAX + 1) * sizeof(unsigned int));
+            pCompressor->open_intervals = (unsigned int *)malloc((LCP_AND_TAG_MAX + 1) * sizeof(unsigned int));

            if (pCompressor->open_intervals) {
-               pCompressor->match = (lzsa_match *)malloc(nMaxWindowSize * NMATCHES_PER_OFFSET * sizeof(lzsa_match));
+               pCompressor->arrival = (lzsa_arrival *)malloc(nMaxWindowSize * NMATCHES_PER_OFFSET * sizeof(lzsa_arrival));

-               if (pCompressor->match) {
-                  if (pCompressor->format_version == 2) {
-                     pCompressor->selected_match = (lzsa_match *)malloc(nMaxWindowSize * NMATCHES_PER_OFFSET * sizeof(lzsa_match));
+               if (pCompressor->arrival) {
+                  pCompressor->best_match = (lzsa_match *)malloc(nMaxWindowSize * sizeof(lzsa_match));

-                     if (pCompressor->selected_match) {
-                        pCompressor->best_match = (lzsa_match *)malloc(nMaxWindowSize * sizeof(lzsa_match));
+                  if (pCompressor->best_match) {
+                     pCompressor->improved_match = (lzsa_match *)malloc(nMaxWindowSize * sizeof(lzsa_match));

-                        if (pCompressor->best_match) {
-                           pCompressor->improved_match = (lzsa_match *)malloc(nMaxWindowSize * sizeof(lzsa_match));
-
-                           if (pCompressor->improved_match) {
-                              pCompressor->slot_cost = (int *)malloc(nMaxWindowSize * NMATCHES_PER_OFFSET * sizeof(int));
-
-                              if (pCompressor->slot_cost) {
-                                 pCompressor->repmatch_opt = (lzsa_repmatch_opt *)malloc(nMaxWindowSize * sizeof(lzsa_repmatch_opt));
-
-                                 if (pCompressor->repmatch_opt)
-                                    return 0;
-                              }
-                           }
-                        }
+                     if (pCompressor->improved_match) {
+                        if (pCompressor->format_version == 2)
+                           pCompressor->match = (lzsa_match *)malloc(nMaxWindowSize * 32 * sizeof(lzsa_match));
+                        else
+                           pCompressor->match = (lzsa_match *)malloc(nMaxWindowSize * 8 * sizeof(lzsa_match));
+                        if (pCompressor->match)
+                           return 0;
                     }
                  }
-                  else {
-                     return 0;
-                  }
               }
            }
         }
@ -131,14 +124,9 @@ int lzsa_compressor_init(lzsa_compressor *pCompressor, const int nMaxWindowSize,
 void lzsa_compressor_destroy(lzsa_compressor *pCompressor) {
   divsufsort_destroy(&pCompressor->divsufsort_context);

-   if (pCompressor->repmatch_opt) {
-      free(pCompressor->repmatch_opt);
-      pCompressor->repmatch_opt = NULL;
-   }
-
-   if (pCompressor->slot_cost) {
-      free(pCompressor->slot_cost);
-      pCompressor->slot_cost = NULL;
+   if (pCompressor->match) {
+      free(pCompressor->match);
+      pCompressor->match = NULL;
   }

   if (pCompressor->improved_match) {
@ -146,21 +134,16 @@ void lzsa_compressor_destroy(lzsa_compressor *pCompressor) {
      pCompressor->improved_match = NULL;
   }

+   if (pCompressor->arrival) {
+      free(pCompressor->arrival);
+      pCompressor->arrival = NULL;
+   }
+
   if (pCompressor->best_match) {
      free(pCompressor->best_match);
      pCompressor->best_match = NULL;
   }

-   if (pCompressor->selected_match) {
-      free(pCompressor->selected_match);
-      pCompressor->selected_match = NULL;
-   }
-
-   if (pCompressor->match) {
-      free(pCompressor->match);
-      pCompressor->match = NULL;
-   }
-
   if (pCompressor->open_intervals) {
      free(pCompressor->open_intervals);
      pCompressor->open_intervals = NULL;
@ -202,7 +185,7 @@ int lzsa_compressor_shrink_block(lzsa_compressor *pCompressor, unsigned char *pI
      if (nPreviousBlockSize) {
         lzsa_skip_matches(pCompressor, 0, nPreviousBlockSize);
      }
-      lzsa_find_all_matches(pCompressor, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);
+      lzsa_find_all_matches(pCompressor, (pCompressor->format_version == 2) ? 32 : 8, nPreviousBlockSize, nPreviousBlockSize + nInDataSize);

      if (pCompressor->format_version == 1) {
         nCompressedSize = lzsa_optimize_and_write_block_v1(pCompressor, pInWindow, nPreviousBlockSize, nInDataSize, pOutData, nMaxOutDataSize);
--- a/src/shrink_context.h
+++ b/src/shrink_context.h
@ -34,14 +34,15 @@
 #define _SHRINK_CONTEXT_H

 #include "divsufsort.h"
-#include "hashmap.h"

 #ifdef __cplusplus
 extern "C" {
 #endif

 #define LCP_BITS 14
-#define LCP_MAX (1U<<(LCP_BITS - 1))
+#define TAG_BITS 3
+#define LCP_MAX (1U<<(LCP_BITS - TAG_BITS - 1))
+#define LCP_AND_TAG_MAX (1U<<(LCP_BITS - 1))
 #define LCP_SHIFT (31-LCP_BITS)
 #define LCP_MASK (((1U<<LCP_BITS) - 1) << LCP_SHIFT)
 #define POS_MASK ((1U<<LCP_SHIFT) - 1)
@ -56,7 +57,7 @@ extern "C" {
 #define LAST_MATCH_OFFSET 4
 #define LAST_LITERALS 1

-#define MODESWITCH_PENALTY 1
+#define MODESWITCH_PENALTY 3

 /** One match */
 typedef struct _lzsa_match {
@ -64,12 +65,50 @@ typedef struct _lzsa_match {
   unsigned short offset;
 } lzsa_match;

-/** One rep-match slot (for LZSA2) */
-typedef struct _lzsa_repmatch_opt {
-   int incoming_offset;
-   short best_slot_for_incoming;
-   short expected_repmatch;
-} lzsa_repmatch_opt;
+/** Forward arrival slot */
+typedef struct {
+   int cost;
+   int from_pos;
+   short from_slot;
+
+   unsigned short rep_offset;
+   unsigned short rep_len;
+   int rep_pos;
+   int num_literals;
+   int score;
+
+   unsigned short match_offset;
+   unsigned short match_len;
+} lzsa_arrival;
+
+/** Compression statistics */
+typedef struct _lzsa_stats {
+   int min_literals;
+   int max_literals;
+   int total_literals;
+
+   int min_offset;
+   int max_offset;
+   int num_rep_offsets;
+   int total_offsets;
+
+   int min_match_len;
+   int max_match_len;
+   int total_match_lens;
+
+   int min_rle1_len;
+   int max_rle1_len;
+   int total_rle1_lens;
+
+   int min_rle2_len;
+   int max_rle2_len;
+   int total_rle2_lens;
+
+   int literals_divisor;
+   int match_divisor;
+   int rle1_divisor;
+   int rle2_divisor;
+} lzsa_stats;

 /** Compression context */
 typedef struct _lzsa_compressor {
@ -78,18 +117,15 @@ typedef struct _lzsa_compressor {
   unsigned int *pos_data;
   unsigned int *open_intervals;
   lzsa_match *match;
-   lzsa_match *selected_match;
   lzsa_match *best_match;
   lzsa_match *improved_match;
-   int *slot_cost;
-   lzsa_repmatch_opt *repmatch_opt;
+   lzsa_arrival *arrival;
   int min_match_size;
-   int max_forward_depth;
   int format_version;
   int flags;
   int safe_dist;
   int num_commands;
-   lzsa_hashmap_t cost_map;
+   lzsa_stats stats;
 } lzsa_compressor;

 /**
--- a/src/shrink_inmem.c
+++ b/src/shrink_inmem.c
@ -84,21 +84,6 @@ size_t lzsa_compress_inmem(unsigned char *pInputData, unsigned char *pOutBuffer,
      }
   }

-   if ((compressor.flags & LZSA_FLAG_FAVOR_RATIO)) {
-      if (nInputSize < 16384)
-         compressor.max_forward_depth = 25;
-      else {
-         if (nInputSize < 32768)
-            compressor.max_forward_depth = 15;
-         else {
-            if (nInputSize < BLOCK_SIZE)
-               compressor.max_forward_depth = 10;
-            else
-               compressor.max_forward_depth = 0;
-         }
-      }
-   }
-
   int nPreviousBlockSize = 0;
   int nNumBlocks = 0;

--- a/src/shrink_streaming.c
+++ b/src/shrink_streaming.c
@ -70,11 +70,13 @@ static void lzsa_delete_file(const char *pszInFilename) {
 * @param pOriginalSize pointer to returned input(source) size, updated when this function is successful
 * @param pCompressedSize pointer to returned output(compressed) size, updated when this function is successful
 * @param pCommandCount pointer to returned token(compression commands) count, updated when this function is successful
+ * @param pSafeDist pointer to return safe distance for raw blocks, updated when this function is successful
+ * @param pStats pointer to compression stats that are filled if this function is successful, or NULL
 *
 * @return LZSA_OK for success, or an error value from lzsa_status_t
 */
 lzsa_status_t lzsa_compress_file(const char *pszInFilename, const char *pszOutFilename, const char *pszDictionaryFilename, const unsigned int nFlags, const int nMinMatchSize, const int nFormatVersion,
-      void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist) {
+      void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist, lzsa_stats *pStats) {
   lzsa_stream_t inStream, outStream;
   void *pDictionaryData = NULL;
   int nDictionaryDataSize = 0;
@ -98,7 +100,7 @@ lzsa_status_t lzsa_compress_file(const char *pszInFilename, const char *pszOutFi
      return nStatus;
   }

-   nStatus = lzsa_compress_stream(&inStream, &outStream, pDictionaryData, nDictionaryDataSize, nFlags, nMinMatchSize, nFormatVersion, progress, pOriginalSize, pCompressedSize, pCommandCount, pSafeDist);
+   nStatus = lzsa_compress_stream(&inStream, &outStream, pDictionaryData, nDictionaryDataSize, nFlags, nMinMatchSize, nFormatVersion, progress, pOriginalSize, pCompressedSize, pCommandCount, pSafeDist, pStats);

   lzsa_dictionary_free(&pDictionaryData);
   outStream.close(&outStream);
@ -127,12 +129,14 @@ lzsa_status_t lzsa_compress_file(const char *pszInFilename, const char *pszOutFi
 * @param pOriginalSize pointer to returned input(source) size, updated when this function is successful
 * @param pCompressedSize pointer to returned output(compressed) size, updated when this function is successful
 * @param pCommandCount pointer to returned token(compression commands) count, updated when this function is successful
+ * @param pSafeDist pointer to return safe distance for raw blocks, updated when this function is successful
+ * @param pStats pointer to compression stats that are filled if this function is successful, or NULL
 *
 * @return LZSA_OK for success, or an error value from lzsa_status_t
 */
 lzsa_status_t lzsa_compress_stream(lzsa_stream_t *pInStream, lzsa_stream_t *pOutStream, const void *pDictionaryData, int nDictionaryDataSize,
                                   const unsigned int nFlags, const int nMinMatchSize, const int nFormatVersion,
-                                   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist) {
+                                   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist, lzsa_stats *pStats) {
   unsigned char *pInData, *pOutData;
   lzsa_compressor compressor;
   long long nOriginalSize = 0LL, nCompressedSize = 0LL;
@ -200,21 +204,6 @@ lzsa_status_t lzsa_compress_stream(lzsa_stream_t *pInStream, lzsa_stream_t *pOut
         }
         nDictionaryDataSize = 0;

-         if (nNumBlocks == 0 && (compressor.flags & LZSA_FLAG_FAVOR_RATIO)) {
-            if (nInDataSize < 16384)
-               compressor.max_forward_depth = 25;
-            else {
-               if (nInDataSize < 32768)
-                  compressor.max_forward_depth = 15;
-               else {
-                  if (nInDataSize < BLOCK_SIZE)
-                     compressor.max_forward_depth = 10;
-                  else
-                     compressor.max_forward_depth = 0;
-               }
-            }
-         }
-
         int nOutDataSize;

         nOutDataSize = lzsa_compressor_shrink_block(&compressor, pInData + BLOCK_SIZE - nPreviousBlockSize, nPreviousBlockSize, nInDataSize, pOutData, ((nInDataSize + nRawPadding) >= BLOCK_SIZE) ? BLOCK_SIZE : (nInDataSize + nRawPadding));
@ -302,6 +291,10 @@ lzsa_status_t lzsa_compress_stream(lzsa_stream_t *pInStream, lzsa_stream_t *pOut

   int nCommandCount = lzsa_compressor_get_command_count(&compressor);
   int nSafeDist = compressor.safe_dist;
+
+   if (pStats)
+      *pStats = compressor.stats;
+
   lzsa_compressor_destroy(&compressor);

   free(pOutData);
--- a/src/shrink_streaming.h
+++ b/src/shrink_streaming.h
@ -41,6 +41,7 @@ extern "C" {

 /* Forward declaration */
 typedef enum _lzsa_status_t lzsa_status_t;
+typedef struct _lzsa_stats lzsa_stats;

 /*-------------- File API -------------- */

@ -57,12 +58,14 @@ typedef enum _lzsa_status_t lzsa_status_t;
 * @param pOriginalSize pointer to returned input(source) size, updated when this function is successful
 * @param pCompressedSize pointer to returned output(compressed) size, updated when this function is successful
 * @param pCommandCount pointer to returned token(compression commands) count, updated when this function is successful
+ * @param pSafeDist pointer to return safe distance for raw blocks, updated when this function is successful
+ * @param pStats pointer to compression stats that are filled if this function is successful, or NULL
 *
 * @return LZSA_OK for success, or an error value from lzsa_status_t
 */
 lzsa_status_t lzsa_compress_file(const char *pszInFilename, const char *pszOutFilename, const char *pszDictionaryFilename,
   const unsigned int nFlags, const int nMinMatchSize, const int nFormatVersion,
-   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist);
+   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist, lzsa_stats *pStats);

 /*-------------- Streaming API -------------- */

@ -80,12 +83,14 @@ lzsa_status_t lzsa_compress_file(const char *pszInFilename, const char *pszOutFi
 * @param pOriginalSize pointer to returned input(source) size, updated when this function is successful
 * @param pCompressedSize pointer to returned output(compressed) size, updated when this function is successful
 * @param pCommandCount pointer to returned token(compression commands) count, updated when this function is successful
+ * @param pSafeDist pointer to return safe distance for raw blocks, updated when this function is successful
+ * @param pStats pointer to compression stats that are filled if this function is successful, or NULL
 *
 * @return LZSA_OK for success, or an error value from lzsa_status_t
 */
 lzsa_status_t lzsa_compress_stream(lzsa_stream_t *pInStream, lzsa_stream_t *pOutStream, const void *pDictionaryData, int nDictionaryDataSize,
   const unsigned int nFlags, const int nMinMatchSize, const int nFormatVersion,
-   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist);
+   void(*progress)(long long nOriginalSize, long long nCompressedSize), long long *pOriginalSize, long long *pCompressedSize, int *pCommandCount, int *pSafeDist, lzsa_stats *pStats);

 #ifdef __cplusplus
 }
Author	SHA1	Message	Date
Emmanuel Marty	f4cf97f176	Merge pull request #34 from specke/master Added option for unrolled copying of long matches	2019-10-22 21:52:48 +02:00
introspec	d5d788946e	Added an option for unrolling long match copying Usually useless and costing +57 bytes, this option can bring dramatic performance improvements on very compressible data dominated by long matches	2019-10-22 20:11:46 +01:00
introspec	e1e1276c96	Merge pull request #4 from emmanuel-marty/master Re-sync with the main	2019-10-22 20:09:00 +01:00
Emmanuel Marty	16ac8c75af	Add link to PDP-11 depackers by Ivan Gorodetsky	2019-10-22 17:13:05 +02:00
Emmanuel Marty	05d77095ca	Bump version	2019-10-22 12:39:27 +02:00
Emmanuel Marty	b84fe7c332	Further increase LZSA2 ratio by ~0.1% on average	2019-10-22 12:37:46 +02:00
Emmanuel Marty	7dd039a152	Delete shrink_context.h	2019-10-22 12:37:16 +02:00
Emmanuel Marty	9f6ca2c25f	Delete shrink_block_v2.c	2019-10-22 12:37:04 +02:00
Emmanuel Marty	dbaa3fa921	Further increase LZSA2 ratio by ~0.1% on average	2019-10-22 12:36:41 +02:00
Emmanuel Marty	2926ad8436	Remove unused #includes	2019-10-21 12:29:38 +02:00
Emmanuel Marty	d9156d3d2b	Reduce LZSA1 token count by 2.5% on average	2019-10-19 13:10:41 +02:00
Emmanuel Marty	6adf92fc88	Merge pull request #33 from specke/master -1 byte	2019-10-11 10:18:05 +02:00
Emmanuel Marty	96df02c532	Remove unused code	2019-10-11 09:20:36 +02:00
Emmanuel Marty	89f1664ae6	Remove unused code	2019-10-11 09:14:19 +02:00
Emmanuel Marty	c363ecf527	Remove unused code	2019-10-11 09:11:49 +02:00
Emmanuel Marty	5141ed7c59	Remove unused code	2019-10-11 09:11:41 +02:00
Emmanuel Marty	c77c666568	Remove unused code	2019-10-11 09:10:07 +02:00
Emmanuel Marty	115a81cb71	Remove unused code	2019-10-11 09:09:42 +02:00
Emmanuel Marty	4436f216ce	Bump version	2019-10-11 09:06:50 +02:00
Emmanuel Marty	baa53f6889	Newly compressed LZSA2 files depack 0.7% faster	2019-10-11 09:05:58 +02:00
introspec	495a12216f	-1 byte Very slightly faster too	2019-10-11 00:23:43 +01:00
Emmanuel Marty	b5117c3dfe	Fixes for -stats	2019-10-11 00:25:46 +02:00
Emmanuel Marty	f5ef6bf868	Merge pull request #32 from specke/master Slightly faster unlzsa2_fast.asm for Z80	2019-10-11 00:22:12 +02:00
introspec	566e3a94e8	+0.2% speed also, added an option to unroll LDIR for longer matches (which adds 38 bytes, but can be significantly faster for files with many long matches)	2019-10-10 22:50:23 +01:00
introspec	e3d7ec9c40	Merge pull request #3 from emmanuel-marty/master Sync with E.Marty's branch	2019-10-10 22:46:53 +01:00
Emmanuel Marty	d209b73a30	Fix small bug	2019-10-10 14:42:08 +02:00
Emmanuel Marty	c1b18fb9fd	Implement -stats	2019-10-09 18:20:22 +02:00
Emmanuel Marty	6ce846ff24	Speed up LZSA2 compression	2019-10-09 16:07:29 +02:00
Emmanuel Marty	b09dadb1c1	Small LZSA2 token count reduction	2019-10-09 13:16:29 +02:00
Emmanuel Marty	03f841d04f	Speed up LZSA2 compression	2019-10-08 20:26:21 +02:00
Emmanuel Marty	44df8f3d2d	Add early-out, speed LZSA2 compression up further	2019-10-08 16:23:33 +02:00
Emmanuel Marty	bfb383befd	Speed up LZSA2 compression	2019-10-08 09:39:18 +02:00
Emmanuel Marty	39e2a90f81	Prevent small matchfinder inefficiency	2019-10-04 11:54:54 +02:00
Emmanuel Marty	33327201f7	Fix small LZSA2 token reduction inefficiency	2019-10-03 16:58:34 +02:00
Emmanuel Marty	29c6f3b2a3	Remove erroneous else statement	2019-09-26 19:13:09 +02:00
Emmanuel Marty	6a62f7d795	Update Z80 depackers changes history	2019-09-26 11:42:52 +02:00
Emmanuel Marty	681f78d1e8	Rename	2019-09-26 07:48:59 +02:00
Emmanuel Marty	8015ab8650	Rename	2019-09-26 07:48:44 +02:00
Emmanuel Marty	2f15298343	Rename	2019-09-26 07:48:33 +02:00
Emmanuel Marty	648a308d87	Rename	2019-09-26 07:48:19 +02:00
Emmanuel Marty	587a92f4ab	Rename Z80 depackers, add version history to LZSA1	2019-09-26 07:47:43 +02:00
Emmanuel Marty	7d9135c548	Update Z80 decompressors	2019-09-25 08:09:18 +02:00
Emmanuel Marty	ac9de3795c	Update Pareto frontier graph from spke	2019-09-25 07:56:47 +02:00
Emmanuel Marty	b4b4d39eff	Fix newly added external link	2019-09-24 18:03:20 +02:00
Emmanuel Marty	cb46987628	Update stats and links	2019-09-24 18:02:24 +02:00
Emmanuel Marty	e55c80a475	Clean up use of MODESWITCH_PENALTY; bump version	2019-09-24 14:43:17 +02:00
Emmanuel Marty	de0ff5d3b0	Reduce memory used for compression	2019-09-24 00:21:17 +02:00
Emmanuel Marty	249b8a4c46	Increase LZSA2 ratio and use forward parser for -m	2019-09-23 20:24:50 +02:00
Emmanuel Marty	74040890fc	Speed up LZSA2 compression (same binary output)	2019-09-23 16:58:03 +02:00
Emmanuel Marty	81e15d10f0	Add extra safety checks to LZSA2 token reducer	2019-09-22 20:41:09 +02:00
Emmanuel Marty	1869d85c1f	Simplify LZSA1 token reducer (same binary output)	2019-09-22 20:34:08 +02:00
Emmanuel Marty	1a4f662360	Bump version	2019-09-20 12:26:16 +02:00
Emmanuel Marty	c12e20b7fb	Improve LZSA2 compression ratio	2019-09-20 12:24:27 +02:00
Emmanuel Marty	51644ad2f9	Speed LZSA2 compression up further; fix typo	2019-09-19 17:18:37 +02:00
Emmanuel Marty	1495b27f69	Speed up LZSA1 compression with forward arrivals	2019-09-19 12:57:39 +02:00
Emmanuel Marty	c052a188f2	Reduce LZSA2 forward arrivals memory use	2019-09-19 11:46:03 +02:00
Emmanuel Marty	e4076e4090	Speed LZSA2 compression up; tiny ratio increase	2019-09-19 00:11:26 +02:00
Emmanuel Marty	8b7d0ab04d	Increase LZSA2 ratio. Decrease token count	2019-09-17 08:10:52 +02:00
Emmanuel Marty	b1da9c1aee	Add extra bound checks in C decompressors	2019-09-12 16:19:14 +02:00
Emmanuel Marty	b92a003338	Merge pull request #29 from francois-berder/master Various improvements -- thank you!	2019-08-28 13:50:00 +02:00
Francois Berder	4f2d7da136	Fix main return value if compressing Signed-off-by: Francois Berder <18538310+francois-berder@users.noreply.github.com>	2019-08-28 09:41:54 +01:00
Francois Berder	a318ac2f83	Fix memory leak in comparestream_open Signed-off-by: Francois Berder <18538310+francois-berder@users.noreply.github.com>	2019-08-28 09:40:49 +01:00
Francois Berder	da67938978	Set dictionnary to NULL in lzsa_dictionary_free Signed-off-by: Francois Berder <18538310+francois-berder@users.noreply.github.com>	2019-08-28 09:39:07 +01:00
Emmanuel Marty	2d213bcff1	Bump version number	2019-08-27 13:18:23 +02:00
Emmanuel Marty	9de7e930e9	Faster LZSA1 z80 decompression	2019-08-27 13:16:20 +02:00
Emmanuel Marty	ef259e6867	Implement forward arrivals optimal parsers	2019-08-27 00:51:34 +02:00
Emmanuel Marty	90b4da64d1	Merge pull request #27 from uniabis/twobytesshorter 2bytes shorter	2019-08-26 23:49:27 +02:00
uniabis	a807344343	2bytes shorter	2019-08-22 12:55:55 +09:00
Emmanuel Marty	27d0fe4e83	Merge pull request #26 from arm-in/patch-1 Update README.md	2019-08-06 20:54:24 +02:00
Armin Müller	f8e445a98a	Update README.md Now 67 bytes with commit `be30cae636`	2019-08-06 20:15:59 +02:00
Emmanuel Marty	0e567bde47	Merge pull request #25 from specke/master -1 byte	2019-08-06 20:03:52 +02:00
introspec	be30cae636	-1 byte slightly slower, but this is the size-optimized branch	2019-08-06 12:36:27 +01:00
Emmanuel Marty	1b368e71ad	Fix comments, header single inclusion defines	2019-08-04 16:42:30 +02:00
Emmanuel Marty	d98220ff42	Merge pull request #24 from specke/master New Pareto frontier graph	2019-08-01 16:51:29 +02:00
introspec	d412433df4	New Pareto frontier graph Shows improved performance of the new Z80 decompressors, esp. due to the improvements by uniabis	2019-08-01 15:26:53 +01:00
Emmanuel Marty	77c1492310	Merge pull request #23 from specke/master New faster and shorter decompressors	2019-08-01 16:19:24 +02:00
introspec	44bff39de3	New faster and shorter decompressors This update is mostly about better integration of improvements by uniabis, with spke contributing several smaller size optimizations.	2019-08-01 15:07:14 +01:00
Emmanuel Marty	3c690b04f5	Merge pull request #22 from specke/master incorporated improvements by uniabis	2019-08-01 01:34:46 +02:00
introspec	e7bb1faece	Merge branch 'master' into master	2019-07-31 23:24:30 +01:00
Emmanuel Marty	e48d2dafde	Merge pull request #21 from uniabis/hd64180 - up to 3% speedup on Z80! hd64180 support on z80 unpacker	2019-07-31 23:57:23 +02:00
introspec	51ef92cdab	incorporated improvements by uniabis also, slightly faster decompression for fast packer in backwards mode	2019-07-31 20:42:47 +01:00
uniabis	8d0528fddc	hd64180 support a bit faster, a bit smaller	2019-07-31 01:39:27 +09:00
Emmanuel Marty	b3aae36ecc	Bump version	2019-07-28 00:25:51 +02:00
Emmanuel Marty	8787b1c3d8	Merge pull request #20 from specke/master (should be already fixed now..) fix a bug in the backward version of unlzsa2_fast_v1.asm	2019-07-27 15:50:38 +02:00
Emmanuel Marty	0a04796b19	Fix for z80 LZSA2 fast backward depacker	2019-07-27 15:39:44 +02:00
introspec	ac3bf78273	fix a bug in the backward version of unlzsa2_fast_v1.asm an INC HL slipped through	2019-07-27 14:14:54 +01:00
Emmanuel Marty	82edcb8bb5	Fix literal runs that are multiple of 256 bytes	2019-07-27 01:35:46 +02:00
Emmanuel Marty	b613d01565	Test incompressible data with raw blocks	2019-07-26 13:30:41 +02:00
Emmanuel Marty	ae4cc12aed	Use ACME syntax	2019-07-26 12:31:26 +02:00
Emmanuel Marty	316dfdcdce	Fix comments, remove unused vars	2019-07-26 01:12:17 +02:00