mirror of
https://github.com/emmanuel-marty/lzsa.git
synced 2024-12-22 16:30:01 +00:00
66 lines
4.6 KiB
Markdown
66 lines
4.6 KiB
Markdown
|
# Block data format (LZSA1)
|
||
|
|
||
|
Blocks encoded as LZSA1 are composed from consecutive commands. Each command follows this format:
|
||
|
|
||
|
* token: <O|LLL|MMMM>
|
||
|
* optional extra literal length
|
||
|
* literal values
|
||
|
* match offset low
|
||
|
* optional match offset high
|
||
|
* optional extra encoded match length
|
||
|
|
||
|
**token**
|
||
|
|
||
|
The token byte is broken down into three parts:
|
||
|
|
||
|
7 6 5 4 3 2 1 0
|
||
|
O L L L M M M M
|
||
|
|
||
|
* L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
|
||
|
* M: 4-bit encoded match length (0-14, or 15 if extended). Likewise, if the encoded match length for this command is 0 to 14, it is directly stored, otherwise 15 is stored and extra bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 3 bytes. For instance, an actual match length of 10 bytes to be copied, is encoded as 7.
|
||
|
* O: set for a 2-bytes match offset, clear for a 1-byte match offset
|
||
|
|
||
|
**optional extra literal length**
|
||
|
|
||
|
If the literals length is 7 or more, the 'L' bits in the token form the value 7, and an extra byte follows here, with three possible types of value:
|
||
|
|
||
|
* 0-248: the value is added to the 7 stored in the token, to compose the final literals length. For instance a length of 206 will be stored as 7 in the token + a single byte with the value of 199, as 7 + 199 = 206.
|
||
|
* 250: a second byte follows. The final literals value is 256 + the second byte. For instance, a literals length of 499 is encoded as 7 in the token, a byte with the value of 250, and a final byte with the value of 243, as 256 + 243 = 499.
|
||
|
* 249: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1024 is stored as 7 in the token, then byte values of 249, 0 and 4, as (4 * 256) = 1024.
|
||
|
|
||
|
The extension byte values are chosen so that all three cases can be detected on 8-bit CPUs with a simple addition and overflow check.
|
||
|
|
||
|
**literal values**
|
||
|
|
||
|
Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
|
||
|
|
||
|
Important note: for blocks that are part of a stream, the last command in a block ends here, as it always contains literals only. For raw blocks, the last command does contain the match offset and match length, see the note below for EOD detection.
|
||
|
|
||
|
**match offset low**
|
||
|
|
||
|
The low 8 bits of the match offset follows.
|
||
|
|
||
|
**optional match offset high**
|
||
|
|
||
|
If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset follow, otherwise they are understood to be all set to 1. For instance, a short offset of 0x70 is interpreted as 0xff70.
|
||
|
|
||
|
**important note regarding match offsets: stored as negative values**
|
||
|
|
||
|
Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy.
|
||
|
|
||
|
**optional extra encoded match length**
|
||
|
|
||
|
If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
|
||
|
|
||
|
* 0-237: the value is added to the 15 stored in the token. The final value is 3 + 15 + this byte.
|
||
|
* 239: a second byte follows. The final match length is 256 + the second byte.
|
||
|
* 238: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value.
|
||
|
|
||
|
Again, the extension byte values are chosen so that all cases can be detected with a simple addition and overflow check on 8-bit CPUs.
|
||
|
|
||
|
# End Of Data detection for raw blocks
|
||
|
|
||
|
When the LZSA1 block is part of a stream (see StreamFormat.md), as previously mentioned, the block ends after the literal values of the last command, without a match offset or match length.
|
||
|
|
||
|
However, in a raw LZSA1 block, the last command does include a 1-byte match offset (set to zero) and a match length. The match length is encoded as a long zero: the 'M' bits in the token form the value 15, then an extra match length byte is present, with the value 238 ("two match length bytes follow"). Finally, a two-byte zero match length follows, indicating the end of the block. EOD is the only time a zero match length (which normally would indicate a copy of 3 bytes) is encoded as a large 2-byte match value. This allows the EOD test to exist in a rarely used code branch.
|