Update format spec, stats

2024-11-21 14:31:01 +00:00 · 2019-05-14 18:38:40 +02:00 · 2019-05-14 18:38:40 +02:00 · 635e575992
commit 635e575992
parent a708a02048
4 changed files with 207 additions and 101 deletions
--- a/BlockFormat_LZSA1.md
+++ b/BlockFormat_LZSA1.md
@ -0,0 +1,65 @@
 # Block data format (LZSA1)
 Blocks encoded as LZSA1 are composed from consecutive commands. Each command follows this format:
 * token: <O|LLL|MMMM>
 * optional extra literal length
 * literal values
 * match offset low
 * optional match offset high
 * optional extra encoded match length
 **token**
 The token byte is broken down into three parts:
    7 6 5 4 3 2 1 0
    O L L L M M M M
 * L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
 * M: 4-bit encoded match length (0-14, or 15 if extended). Likewise, if the encoded match length for this command is 0 to 14, it is directly stored, otherwise 15 is stored and extra bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 3 bytes. For instance, an actual match length of 10 bytes to be copied, is encoded as 7.
 * O: set for a 2-bytes match offset, clear for a 1-byte match offset
 **optional extra literal length**
 If the literals length is 7 or more, the 'L' bits in the token form the value 7, and an extra byte follows here, with three possible types of value:
 * 0-248: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
 * 250: a second byte follows. The final literals value is 256 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 250, and a final byte with the value of 243, as 256 + 243 = 499.
 * 249: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 249, 0 and 4, as (4 * 256) = 1024.
 The extension byte values are chosen so that all three cases can be detected on 8-bit CPUs with a simple addition and overflow check.
 **literal values**
 Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
 Important note: for blocks that are part of a stream, the last command in a block ends here, as it always contains literals only. For raw blocks, the last command does contain the match offset and match length, see the note below for EOD detection.
 **match offset low**
 The low 8 bits of the match offset follows.
 **optional match offset high**
 If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 1. For instance, a short offset of 0x70 is interpreted as 0xff70.
 **important note regarding match offsets: stored as negative values**
 Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy.
 **optional extra encoded match length**
 If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
 * 0-237: the value is added to the 15 stored in the token. The final value is 3 + 15 + this byte.
 * 239: a second byte follows. The final match length is 256 + the second byte.
 * 238: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value.
 Again, the extension byte values are chosen so that all cases can be detected with a simple addition and overflow check on 8-bit CPUs.
 # End Of Data detection for raw blocks
 When the LZSA1 block is part of a stream (see StreamFormat.md), as previously mentioned, the block ends after the literal values of the last command, without a match offset or match length.
 However, in a raw LZSA1 block, the last command does include a 1-byte match offset (set to zero) and a match length. The match length is encoded as a long zero: the 'M' bits in the token form the value 15, then an extra match length byte is present, with the value 238 ("two match length bytes follow"). Finally, a two-byte zero match length follows, indicating the end of the block. EOD is the only time a zero match length (which normally would indicate a copy of 3 bytes) is encoded as a large 2-byte match value. This allows the EOD test to exist in a rarely used code branch.
--- a/BlockFormat_LZSA2.md
+++ b/BlockFormat_LZSA2.md
@ -0,0 +1,78 @@
 # Block data format (LZSA2)
 Blocks encoded as LZSA2 are composed from consecutive commands. Each command follows this format:
 * token: <XYZ|LL|MMM>
 * optional extra literal length
 * literal values
 * match offset
 * optional extra encoded match length
 **token**
 The token byte is broken down into three parts:
    7 6 5 4 3 2 1 0
    X Y Z L L M M M
 * L: 2-bit literals length (0-2, or 3 if extended). If the number of literals for this command is 0 to 2, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 3 is encoded and extra nibbles or bytes follow as 'optional extra literal length'
 * M: 3-bit encoded match length (0-6, or 7 if extended). Likewise, if the encoded match length for this command is 0 to 6, it is directly stored, otherwise 7 is stored and extra nibbles or bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 2 bytes. For instance, an actual match length of 5 bytes to be copied, is encoded as 3.
 * XYZ: 3-bit value that indicates how to decode the match offset
 **optional extra literal length**
 If the literals length is 3 or more, the 'L' bits in the token form the value 3, and an extra nibble is read:
 * 0-14: the value is added to the 3 stored in the token, to compose the final literals length.
 * 15: an extra byte follows
 If an extra byte follows, it can have two possible types of value:
 * 3-255: the value is the final literals length. For instance a length of 206 will be stored as 3 in the token + a nibble with the value of 15 + a single byte with the value of 206.
 * 0: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1027 is stored as 3 in the token, a nibble with the value of 15, then byte values of 0, 3 and 4, as 3 + (4 * 256) = 1024.
 **literal values**
 Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
 Important note: for blocks that are part of a stream, the last command in a block ends here, as it always contains literals only. For raw blocks, the last command does contain the match offset and match length, see the note below for EOD detection.
 **match offset**
 The match offset is decoded according to the XYZ bits in the token
    XYZ
    00Z 5-bit offset: read a nibble for offset bits 0-3 and use bit Z of the token as bit 4 of the offset. set bits 5-15 of the offset to 1.
    01Z 9-bit offset: read a byte for offset bits 0-7 and use bit Z for bit 8 of the offset. set bits 9-15 of the offset to 1.
    10Z 13-bit offset: read a byte for offset bits 0-7, read a nibble for offset bits 8-12 and use bit Z for bit 12 of the offset. set bits 13-15 of the offset to 1.
    110 16-bit offset: read a byte for offset bits 0-7, then another byte for offset bits 8-15.
    111 repeat offset: reuse the offset value of the previous match command.
 **important note regarding match offsets: stored as negative values**
 Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy. For this reason, as already indicated, unexpressed offset bits are set to 1 instead of 0.
 **optional extra encoded match length**
 If the encoded match length is 7 or more, the 'M' bits in the token form the value 7, and an extra nibble is read:
 * 0-14: the value is added to the 3 stored in the token, and then the minmatch of 2 is added, to compose the final match length.
 * 15: an extra byte follows
 If an extra byte follows here, it can have two possible types of value:
 * 2-255: the final match length is this byte.
 * 0: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value.
 # End Of Data detection for raw blocks
 When the LZSA2 block is part of a stream (see StreamFormat.md), as previously mentioned, the block ends after the literal values of the last command, without a match offset or match length.
 However, in a raw LZSA2 block, the last command does include a 9-bit match offset and a match length. The match length is encoded as a long zero: the 'M' bits in the token form the value 7, then a nibble with the value of 15 is present, then an extra match length byte with the value of 0 ("two match length bytes follow"). Finally, a two-byte zero match length follows, indicating the end of the block. EOD is the only time a zero match length (which normally would indicate a copy of 3 bytes) is encoded as a large 2-byte match value. This allows the EOD test to exist in a rarely used code branch.
 # Reading nibbles
 When the specification indicates that a nibble (4 bit value) must be read:
 * If there are no nibbles ready, read a byte immediately. Return the high 4 bits (bits 4-7) as the nibble and store the low 4 bits for later. Flag that a nibble is ready for next time.
 * If a nibble is ready, return the previously stored low 4 bits (bits 0-3) and flag that no nibble is ready for next time.
--- a/README.md
+++ b/README.md
@ -1,137 +1,61 @@
-LZSA is a byte-aligned compression format that is specifically engineered for very fast decompression on 8-bit systems. It can compress files of any size by using blocks of a maximum size of 64 Kb with block-interdependent compression and up to 64 Kb of back-references for matches.
+LZSA is a collection of byte-aligned compression formats that are specifically engineered for very fast decompression on 8-bit systems. It can compress files of any size by using blocks of a maximum size of 64 Kb with block-interdependent compression and up to 64 Kb of back-references for matches.
 The LZSA compression tool uses an aggressive optimal packing strategy to try to find the sequence of commands that gives the smallest packed file that decompresses to the original while maintaining the maximum possible decompression speed.
 The compression formats give the user choices that range from decompressing faster than LZ4 on 8-bit systems with better compression, to compressing as well as ZX7 with much better decompression speed. LZSA1 is designed to replace LZ4 and LZSA2 to replace ZX7, in 8-bit scenarios.
 Compression ratio comparison between LZSA and other optimal packers, for a workload composed of ZX Spectrum and C64 files:
                         Bytes            Ratio            Decompression speed vs. LZ4
    LZSA2                685610           53,18% <------   75%                
    ZX7                  687133           53,30%           47,73%
    LZ5 1.4.1            727107           56,40%           75%
-    LZSA                 736169           57,11% <------   90%
+    LZSA1                736169           57,11% <------   90%
    Lizard -29           776122           60,21%           Not measured
    LZ4_HC -19 -B4 -BD   781049           60,59%           100%
    Uncompressed         1289127          100%             N/A
 Performance over well-known compression corpus files:
-                         Uncompressed     LZ4_HC -19 -B4 -BD    LZSA
+                         Uncompressed     LZ4_HC -19 -B4 -BD    LZSA1                LZSA2
-    Canterbury           2810784          935827 (33,29%)       855044 (30,42%)
+    Canterbury           2810784          935827 (33,29%)       855044 (30,42%)      789075 (28,07%)
-    Silesia              211938580        77299725 (36,47%)     73707039 (34,78%)
+    Silesia              211938580        77299725 (36,47%)     73707039 (34,78%)    69983184 (33,02%)
-    Calgary              3251493          1248780 (38,40%)      1196448 (36,80%)
+    Calgary              3251493          1248780 (38,40%)      1196448 (36,80%)     1125462 (34,61%)
-    Large                11159482         3771025 (33,79%)      3648420 (32,69%)
+    Large                11159482         3771025 (33,79%)      3648420 (32,69%)     3528725 (31,62%)
-    enwik9               1000000000       371841591 (37,18%)    355360717 (35,54%)
+    enwik9               1000000000       371841591 (37,18%)    355360717 (35,54%)   337063553 (33,71%)
-As an example of LZSA's simplicity, a size-optimized decompressor on Z80 has been implemented in 69 bytes.
+As an example of LZSA1's simplicity, a size-optimized decompressor on Z80 has been implemented in 69 bytes.
 The compressor is approximately 2X slower than LZ4_HC but compresses better while maintaining similar decompression speeds and decompressor simplicity.
-The main differences with the LZ4 compression format are:
+The main differences between LZSA1 and the LZ4 compression format are:
 * The use of short (8-bit) match offsets where possible. The match-finder and optimizer cooperate to try and use the shortest match offsets possible.
 * Shorter encoding of lengths. As blocks are maximum 64 Kb in size, lengths can only be up to 64 Kb.
 * As a result of the smaller commands due to the possibly shorter match offsets, a minimum match size of 3 bytes instead of 4. The use of small matches is driven by the optimizer, and used where they provide gains.
 As for LZSA2:
 * 5-bit, 9-bit, 13-bit and 16-bit match offsets, using nibble encoding
 * Shorter encoding of lengths, also using nibbles
 * A minmatch of 2 bytes
 * No (slow) bit-packing. LZSA2 uses byte alignment in the hot path, and nibbles.
 Inspirations:
 * [LZ4](https://github.com/lz4/lz4) by Yann Collet.
 * [LZ5/Lizard](https://github.com/inikep/lizard) by Przemyslaw Skibinski and Yann Collet.
 * The suffix array intervals in [Wimlib](https://wimlib.net/git/?p=wimlib;a=tree) by Eric Biggers.
 * ZX7 by Einar Saukas
 License:
 * The LZSA code is available under the Zlib license.
 * The match finder (matchfinder.c) is available under the CC0 license due to using portions of code from Eric Bigger's Wimlib in the suffix array-based matchfinder.
-# Stream format
+# Compressed format
-The stream format is composed of:
+Decompression code is provided for common 8-bit CPUs such as Z80 and 6502. However, if you would like to write your own, or understand the encoding, LZSA compresses data to a format that is fast and simple to decompress on 8-bit CPUs. It is encoded in either a stream of blocks, or as a single raw block, depending on command-line settings. The encoding is deliberately designed to avoid complicated operations on 8-bits (such as 16-bit math).
 * a header
 * one or more frames
 * a footer
-# Header format
+* [Stream format](https://github.com/emmanuel-marty/lzsa/StreamFormat.md)
-
+* [Block encoding for LZSA1](https://github.com/emmanuel-marty/lzsa/BlockFormat_LZSA1.md)
-The 3-bytes header contains a signature and a traits byte:
+* [Block encoding for LZSA2](https://github.com/emmanuel-marty/lzsa/BlockFormat_LZSA2.md)
    0    1                2
    0x7b 0x9e             0x00
    <--- signature --->   <- traits ->
 The traits are set to 0x00 for this version of the format.
 # Frame format
 Each frame contains a 3-bytes length followed by block data that expands to up to 64 Kb of decompressed data.
    0    1    2
    DSZ0 DSZ1 U|DSZ2
 * DSZ0 (length byte 0) contains bits 0-7 of the block data size
 * DSZ1 (length byte 1) contains bits 8-15 of the block data size
 * DSZ2 (bit 0 of length byte 2) contains bit 16 of the block data size
 * U (bit 7 of length byte 2) is set if the block data is uncompressed, and clear if the block data is compressed.
 * Bits 1..6 of length byte 2 are currently undefined and must be set to 0.
 # Block data format
 LZSA blocks are composed from consecutive commands. Each command follows this format:
 * token: <O|LLL|MMMM>
 * optional extra literal length
 * literal values
 * match offset low
 * optional match offset high
 * optional extra encoded match length
 **token**
 The token byte is broken down into three parts:
    7 6 5 4 3 2 1 0
    O L L L M M M M
 * L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
 * M: 4-bit encoded match length (0-14, or 15 if extended). Likewise, if the encoded match length for this command is 0 to 14, it is directly stored, otherwise 15 is stored and extra bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 3 bytes. For instance, an actual match length of 10 bytes to be copied, is encoded as 7.
 * O: set for a 2-bytes match offset, clear for a 1-byte match offset
 **optional extra literal length**
 If the literals length is 7 or more, the 'L' bits in the token form the value 7, and an extra byte follows here, with three possible types of value:
 * 0-248: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
 * 250: a second byte follows. The final literals value is 256 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 250, and a final byte with the value of 243, as 256 + 243 = 499.
 * 249: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 249, 0 and 4, as (4 * 256) = 1024.
 The extension byte values are chosen so that all three cases can be detected on 8-bit CPUs with a simple addition and overflow check.
 **literal values**
 Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
 Important note: the last command in a block ends here, as it always contains literals only.
 **match offset low**
 The low 8 bits of the match offset follows.
 **optional match offset high**
 If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 1. For instance, a short offset of 0x70 is interpreted as 0xff70.
 **important note regarding match offsets: stored as negative values**
 Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy.
 **optional extra encoded match length**
 If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
 * 0-237: the value is added to the 15 stored in the token. The final value is 3 + 15 + this byte.
 * 239: a second byte follows. The final match length is 256 + the second byte.
 * 238: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value.
 Again, the extension byte values are chosen so that all cases can be detected with a simple addition and overflow check on 8-bit CPUs.
 # Footer format
 The stream ends with the EOD frame: the 3 length bytes are set to 0x00, 0x00, 0x00, and no block data follows.
--- a/StreamFormat.md
+++ b/StreamFormat.md
@ -0,0 +1,39 @@
 # Stream format
 The stream format is composed of:
 * a header
 * one or more frames
 * a footer
 # Header format
 The 3-bytes LZSA header contains a signature and a traits byte:
    0    1                2
    0x7b 0x9e             7 6 5 4 3 2 1
                          V V V Z Z Z Z
    <--- signature --->   <- traits ->
 Trait bits:
 * V: 3 bit code that indicates which block data encoding is used. 0 is LZSA1 and 2 is LZSA2.
 * Z: these bits in the traits are set to 0 for LZSA1 and LZSA2.
 # Frame format
 Each frame contains a 3-bytes length followed by block data that expands to up to 64 Kb of decompressed data. The block data is encoded either as LZSA1 or LZSA2 depending on the V bits of the traits byte in the header.
    0    1    2
    DSZ0 DSZ1 U|DSZ2
 * DSZ0 (length byte 0) contains bits 0-7 of the block data size
 * DSZ1 (length byte 1) contains bits 8-15 of the block data size
 * DSZ2 (bit 0 of length byte 2) contains bit 16 of the block data size
 * U (bit 7 of length byte 2) is set if the block data is uncompressed, and clear if the block data is compressed.
 * Bits 1..6 of length byte 2 are currently undefined and must be set to 0.
 # Footer format
 The stream ends with the EOD frame: the 3 length bytes are set to 0x00, 0x00, 0x00, and no block data follows.