Clarify the encoding of matches, fix some broken formatting.

This commit is contained in:
Emmanuel Marty 2019-04-01 21:00:07 +02:00 committed by GitHub
parent e216b0c544
commit fde853e095
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -66,12 +66,12 @@ Each frame contains a 3-byte length followed by block data that expands to up to
LZSA blocks are composed from consecutive commands. Each command follows this format:
* <token: O|LLL|MMMM>
* <optional extra literal length>
* <literal values>
* <match offset low>
* <optional match offset high>
* <optional extra match length>
* token: <O|LLL|MMMM>
* optional extra literal length
* literal values
* match offset low
* optional match offset high
* optional extra encoded match length
**token**
@ -82,7 +82,7 @@ The token byte is broken down into three parts:
* O: set for a 2-byte match offset, clear for a 1-byte match offset
* L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
* M: 4-bit match length (0-14, or 15 if extended). Likewise, if the match length for this command is 0 to 14, it is directly encoded, otherwise 15 is stored and extra bytes follow as 'optional extra match length'.
* M: 4-bit encoded match length (0-14, or 15 if extended). Likewise, if the encoded match length for this command is 0 to 14, it is directly stored, otherwise 15 is stored and extra bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 3 bytes. For instance, an actual match length of 10 bytes to be copied, is encoded as 7.
**optional extra literal length**
@ -96,6 +96,8 @@ If the literals length is 7 or more, the 'L' bits in the token form the value 7,
Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.
Important note: the last command in a block ends here, as it always contains literals only.
**match offset low**
The low 8 bits of the match offset follows.
@ -108,13 +110,13 @@ If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset
Note that the match offset is *off by 1*: a value of 0 refers to the byte preceding the current output index (N-1). A value of 1 refers to tow bytes before the current output index (N-2) and so on. This is so that match offsets up to 256 can be encoded as a single byte, for extra compression.
**optional extra match length**
**optional extra encoded match length**
If the match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
* 0-254: the value is added to the 15 stored in the token.
* 254: a second byte follows. The final match length is 15 + 254 + the second byte.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final match length is 15 + 255 + that 16-bit value.
* 254: a second byte follows. The final encoded match length is 15 + 254 + the second byte, which gives an actual match length of 3 + 15 + 254 + the second byte.
* 255: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is 15 + 255 + that 16-bit value, which gives an actual match length of 3 + 15 + 255 + that 16-bit value.
# Footer format