From a8e3f7bbcc2ffdab3b54a25e4fe772aa00fef70d Mon Sep 17 00:00:00 2001 From: Emmanuel Marty Date: Mon, 10 Jun 2019 14:08:53 +0200 Subject: [PATCH] Update to final LZSA2 block format --- BlockFormat_LZSA2.md | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/BlockFormat_LZSA2.md b/BlockFormat_LZSA2.md index ad1a9a3..0c40dd4 100644 --- a/BlockFormat_LZSA2.md +++ b/BlockFormat_LZSA2.md @@ -28,8 +28,8 @@ If the literals length is 3 or more, the 'L' bits in the token form the value 3, If an extra byte follows, it can have two possible types of value: -* 3-255: the value is the final literals length. For instance a length of 206 will be stored as 3 in the token + a nibble with the value of 15 + a single byte with the value of 206. -* 0: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1027 is stored as 3 in the token, a nibble with the value of 15, then byte values of 0, 3 and 4, as 3 + (4 * 256) = 1027. +* 0-237: 18 is added to the value (3 from the token + 15 from the nibble), to compose the final literals length. For instance a length of 206 will be stored as 3 in the token + a nibble with the value of 15 + a single byte with the value of 188. +* 239: a second and third byte follow, forming a little-endian 16-bit value. The final literals value is that 16-bit value. For instance, a literals length of 1027 is stored as 3 in the token, a nibble with the value of 15, then byte values of 239, 3 and 4, as 3 + (4 * 256) = 1027. **literal values** @@ -42,12 +42,14 @@ Important note: for blocks that are part of a stream, the last command in a bloc The match offset is decoded according to the XYZ bits in the token XYZ - 00Z 5-bit offset: read a nibble for offset bits 0-3 and use bit Z of the token as bit 4 of the offset. set bits 5-15 of the offset to 1. - 01Z 9-bit offset: read a byte for offset bits 0-7 and use bit Z for bit 8 of the offset. set bits 9-15 of the offset to 1. - 10Z 13-bit offset: read a byte for offset bits 0-7, read a nibble for offset bits 8-12 and use bit Z for bit 12 of the offset. set bits 13-15 of the offset to 1. - 110 16-bit offset: read a byte for offset bits 0-7, then another byte for offset bits 8-15. + 00Z 5-bit offset: read a nibble for offset bits 1-4 and use the inverted bit Z of the token as bit 0 of the offset. set bits 5-15 of the offset to 1. + 01Z 9-bit offset: read a byte for offset bits 0-7 and use the inverted bit Z for bit 8 of the offset. set bits 9-15 of the offset to 1. + 10Z 13-bit offset: read a nibble for offset bits 9-12 and use the inverted bit Z for bit 8 of the offset, then read a byte for offset bits 0-7. set bits 13-15 of the offset to 1. + 110 16-bit offset: read a byte for offset bits 8-15, then another byte for offset bits 0-7. 111 repeat offset: reuse the offset value of the previous match command. +The bit ordering and inversion helps optimize the decoder for size and speed on 8-bit CPUs. + **important note regarding match offsets: stored as negative values** Note that the match offset is negative: it is added to the current decompressed location and not substracted, in order to locate the back-reference to copy. For this reason, as already indicated, unexpressed offset bits are set to 1 instead of 0. @@ -62,13 +64,23 @@ If the encoded match length is 7 or more, the 'M' bits in the token form the val If an extra byte follows here, it can have two possible types of value: * 2-255: the final match length is this byte. -* 0: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value. +* 0-231: 24 is added to the value (7 from the token + 15 from the nibble + minmatch of 2), to compose the final match length. For instance a length of 150 will be stored as 7 in the token + a nibble with the value of 15 + a single byte with the value of 126. +* 233: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is that 16-bit value. # End Of Data detection for raw blocks When the LZSA2 block is part of a stream (see StreamFormat.md), as previously mentioned, the block ends after the literal values of the last command, without a match offset or match length. -However, in a raw LZSA2 block, the last command does include a 9-bit match offset and a match length. The match length is encoded as a long zero: the 'M' bits in the token form the value 7, then a nibble with the value of 15 is present, then an extra match length byte with the value of 0 ("two match length bytes follow"). Finally, a two-byte zero match length follows, indicating the end of the block. EOD is the only time a zero match length (which normally would indicate a copy of 2 bytes) is encoded as a large 2-byte match value. This allows the EOD test to exist in a rarely used code branch. +However, in a raw LZSA2 block, the last command does include a 9-bit match offset (set to zero, to be ignored) and a EOD marker as the match length. The EOD match length marker is encoded as such: the 'M' bits in the token form the value 7, then a nibble with the value of 15 is present, then a single extra match length byte with the value of 232, indicating the end of the block. This allows the EOD test to exist in a rarely used code branch. + +The EOD condition can be easily checked as part of the tri-state condition when handling long matches. When 24 is added to the match byte value: +- If the byte doesn't overflow, the final match length is ready +- If the byte overflows and equals zero, the EOD marker has been hit +- Otherwise, if the overflows and doesn't equal zero, a 16-bit match length must be read. + +This tri-state test translates to only an addition and two branches on 8-bit CPUs. + +The equivalent EOD condition in literal lengths (which would be byte 238, that would overflow to exactly 0 when adding 18) is never emitted, so for size-optimized decompressors, the same code can be used to read both types of lengths. # Reading nibbles