Update README

This commit is contained in:
emmanuel-marty 2019-04-24 10:02:35 +02:00
parent 2b9780bd65
commit b7967c3aa1

View File

@ -1,4 +1,4 @@
LZSA is a byte-aligned compression format that is specifically engineered for very fast decompression on 8-bit systems. It can compress files of any size by using blocks of a maximum size of 64 Kb with block-interdependent compression and up to 64 Kb of back-references for matches.
LZSA is a byte-aligned compression format that is specifically engineered for very fast decompression on 8-bit systems. It can compress files of any size by using blocks of a maximum size of 64 Kb with block-interdependent compression and up to 64 Kb of back-references for matches.
The LZSA compression tool uses an aggressive optimal packing strategy to try to find the sequence of commands that gives the smallest packed file that decompresses to the original while maintaining the maximum possible decompression speed.
@ -7,7 +7,7 @@ Compression ratio comparison between LZSA and other optimal packers, for a workl
Bytes Ratio Decompression speed vs. LZ4
ZX7 687133 53,30% 47,73%
LZ5 1.4.1 727107 56,40% 75%
LZSA 736165 57,11% <------ 90%
LZSA 736171 57,11% <------ 90%
Lizard -29 776122 60,21% Not measured
LZ4_HC -19 -B4 -BD 781049 60,59% 100%
Uncompressed 1289127 100% N/A
@ -20,8 +20,8 @@ Performance over well-known compression corpus files:
Calgary 3251493 1248780 (38,40%) 1196448 (36,80%)
Large 11159482 3771025 (33,79%) 3648420 (32,69%)
enwik9 1000000000 371841591 (37,18%) 355360717 (35,54%)
As an example of LZSA's simplicity, a size-optimized decompressor on 8088 has been implemented in 92 bytes.
As an example of LZSA's simplicity, a size-optimized decompressor on Z80 has been implemented in 69 bytes.
The compressor is approximately 2X slower than LZ4_HC but compresses better while maintaining similar decompression speeds and decompressor simplicity.
@ -31,15 +31,15 @@ The main differences with the LZ4 compression format are:
* Shorter encoding of lengths. As blocks are maximum 64 Kb in size, lengths can only be up to 64 Kb.
* As a result of the smaller commands due to the possibly shorter match offsets, a minimum match size of 3 bytes instead of 4. The use of small matches is driven by the optimizer, and used where they provide gains.
Inspirations:
Inspirations:
* [LZ4](https://github.com/lz4/lz4) by Yann Collet.
* [LZ5/Lizard](https://github.com/inikep/lizard) by Przemyslaw Skibinski and Yann Collet.
* [LZ4](https://github.com/lz4/lz4) by Yann Collet.
* [LZ5/Lizard](https://github.com/inikep/lizard) by Przemyslaw Skibinski and Yann Collet.
* The suffix array intervals in [Wimlib](https://wimlib.net/git/?p=wimlib;a=tree) by Eric Biggers.
License:
* The LZSA code is available under the Zlib license.
* The LZSA code is available under the Zlib license.
* The compressor (shrink.c) is available under the CC0 license due to using portions of code from Eric Bigger's Wimlib in the suffix array-based matchfinder.
# Stream format
@ -98,9 +98,11 @@ The token byte is broken down into three parts:
If the literals length is 7 or more, the 'L' bits in the token form the value 7, and an extra byte follows here, with three possible types of value:
* 0-253: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
* 254: a second byte follows. The final literals value is 7 + 254 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 254, and a final byte with teh value of 238, as 7 + 254 + 238 = 499.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 255, 0 and 4, as (4 * 256) = 1024.
* 0-248: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
* 250: a second byte follows. The final literals value is 256 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 250, and a final byte with the value of 243, as 256 + 243 = 499.
* 249: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 249, 0 and 4, as (4 * 256) = 1024.
The extension byte values are chosen so that all three cases can be detected on 8-bit CPUs with a simple addition and overflow check.
**literal values**
@ -110,23 +112,25 @@ Important note: the last command in a block ends here, as it always contains lit
**match offset low**
The low 8 bits of the match offset follows.
The low 8 bits of the match offset follows.
**optional match offset high**
If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 0.
If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 1. For instance, a short offset of 0x70 is interpreted as 0xff70.
**important note regarding match offsets: off by 1**
**important note regarding match offsets: stored as negative values**
Note that the match offset is *off by 1*: a value of 0 refers to the byte preceding the current output index (N-1). A value of 1 refers to two bytes before the current output index (N-2) and so on. This is so that match offsets up to 256 can be encoded as a single byte, for extra compression.
Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy.
**optional extra encoded match length**
If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
* 0-253: the value is added to the 15 stored in the token. The final value is 3 + 15 + this byte.
* 254: a second byte follows. The final encoded match length is 15 + 254 + the second byte, which gives an actual match length of 3 + 15 + 254 + the second byte.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is 3 + that 16-bit value.
* 0-237: the value is added to the 15 stored in the token. The final value is 3 + 15 + this byte.
* 239: a second byte follows. The final match length is 256 + the second byte.
* 238: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value.
Again, the extension byte values are chosen so that all cases can be detected with a simple addition and overflow check on 8-bit CPUs.
# Footer format