Clarify the encoding of matches, fix some broken formatting.

2025-04-20 04:37:09 +00:00 · 2019-04-01 21:00:07 +02:00 · 2019-04-01 21:00:07 +02:00 · fde853e095
commit fde853e095
parent e216b0c544
1 changed files with 13 additions and 11 deletions
--- a/README.md
+++ b/README.md
@ -66,12 +66,12 @@ Each frame contains a 3-byte length followed by block data that expands to up to

 LZSA blocks are composed from consecutive commands. Each command follows this format:

-* <token: O|LLL|MMMM>
-* <optional extra literal length>
-* <literal values> 
-* <match offset low>
-* <optional match offset high>
-* <optional extra match length>
+* token: <O|LLL|MMMM>
+* optional extra literal length
+* literal values
+* match offset low
+* optional match offset high
+* optional extra encoded match length

 **token**

@ -82,7 +82,7 @@ The token byte is broken down into three parts:

 * O: set for a 2-byte match offset, clear for a 1-byte match offset
 * L: 3-bit literals length (0-6, or 7 if extended). If the number of literals for this command is 0 to 6, the length is encoded in the token and no extra bytes are required. Otherwise, a value of 7 is encoded and extra bytes follow as 'optional extra literal length'
-* M: 4-bit match length (0-14, or 15 if extended). Likewise, if the match length for this command is 0 to 14, it is directly encoded, otherwise 15 is stored and extra bytes follow as 'optional extra match length'.
+* M: 4-bit encoded match length (0-14, or 15 if extended). Likewise, if the encoded match length for this command is 0 to 14, it is directly stored, otherwise 15 is stored and extra bytes follow as 'optional extra encoded match length'. Except for the last command in a block, a command always contains a match, so the encoded match length is the actual match length offset by the minimum, which is 3 bytes. For instance, an actual match length of 10 bytes to be copied, is encoded as 7.

 **optional extra literal length**

@ -96,6 +96,8 @@ If the literals length is 7 or more, the 'L' bits in the token form the value 7,

 Literal bytes, whose number is specified by the literals length, follow here. There can be zero literals in a command.

+Important note: the last command in a block ends here, as it always contains literals only.
+
 **match offset low**

 The low 8 bits of the match offset follows. 
@ -108,13 +110,13 @@ If the 'O' bit (bit 7) is set in the token, the high 8 bits of the match offset

 Note that the match offset is *off by 1*: a value of 0 refers to the byte preceding the current output index (N-1). A value of 1 refers to tow bytes before the current output index (N-2) and so on. This is so that match offsets up to 256 can be encoded as a single byte, for extra compression.

-**optional extra match length**
+**optional extra encoded match length**

-If the match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.
+If the encoded match length is 15 or more, the 'M' bits in the token form the value 15, and an extra byte follows here, with three possible types of value.

 * 0-254: the value is added to the 15 stored in the token.
-* 254: a second byte follows. The final match length is 15 + 254 + the second byte.
-* 255: a second and third byte follow, forming a little-endian 16-bit value. The final match length is 15 + 255 + that 16-bit value.
+* 254: a second byte follows. The final encoded match length is 15 + 254 + the second byte, which gives an actual match length of 3 + 15 + 254 + the second byte.
+* 255: a second and third byte follow, forming a little-endian 16-bit value. The final encoded match length is 15 + 255 + that 16-bit value, which gives an actual match length of 3 + 15 + 255 + that 16-bit value.

 # Footer format