millfork/docs/lang/custom-encoding.md

[< back to index](../doc_index.md)

### Defining custom encodings

Every encoding is defined in an `.tbl` file with an appropriate name.
The file is looked up in the directories on the include path, first directly, then in the `encoding` subdirectory.

The file is a UTF-8 text file, with each line having a specific meaning.
In the specifications below, `<>` are not to be meant literally:

* lines starting with `#`, `;` or `//` are comments.

* `ALIAS=<another encoding name>` defines this encoding to be an alias for another encoding.
No other lines are allowed in the file.

* `NAME=<name>` defines the name for this encoding. Required.

* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
If this directive is present, the only other allowed directive in the file is the `NAME` directive.

* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
Required, unless `BUILTIN` is present.
There have to be two digits, `EOT=0` is invalid.

* lines like `<xx>=<c>` where `<xx>` are two hex digits
and `<c>` is either a **non-whitespace** character or a **BMP** Unicode codepoint written as `U+xxxx`,
define the byte `<xx>` to correspond to character `<c>`.
There have to be two digits, `0=@` is invalid.

* lines like `<xx>-<xx>=<c><c><c><c>` where `<c>` is repeated an appropriate number of times
define characters for multiple byte values.
In this kind of lines, characters cannot be represented as Unicode codepoints.

* lines like `<c>=<xx>`, `<c>=<xx><xx>` etc.
define secondary or alternate characters that are going to be represented as one or more bytes.
There have to be two digits, `@=0` is invalid.
Problematic characters (space, `=`, `#`, `;`) can be written as Unicode codepoints `U+xxxx`.

* a line like `a-z=<xx>` is equivalent to lines `a=<xx>`, `b=<xx+$01>` all the way to `z=<xx+$19>`.

* a line like `KATAKANA=>DECOMPOSE` means that katakana characters with dakuten or handakuten
should be split into the base character and the standalone dakuten/handakuten.

* similarly with `HIRAGANA=>DECOMPOSE`.

* lines like `{<escape code>}=<xx>`, `{<escape code>}=<xx><xx>` etc.
define escape codes. It's a good practice to define these when possible:
`{q}`, `{apos}`, `{n}`, `{lbrace}`, `{rbrace}`, 
`{yen}`, `{pound}`, `{cent}`, `{euro}`, `{copy}`, `{pi}`,
`{nbsp}`, `{shy}`.
The big text encoding overhaul 2020-04-30 23:31:54 +00:00			`[< back to index](../doc_index.md)`

			`### Defining custom encodings`

			Every encoding is defined in an `.tbl` file with an appropriate name.
			The file is looked up in the directories on the include path, first directly, then in the `encoding` subdirectory.

			`The file is a UTF-8 text file, with each line having a specific meaning.`
			In the specifications below, `<>` are not to be meant literally:

			* lines starting with `#`, `;` or `//` are comments.

			* `ALIAS=<another encoding name>` defines this encoding to be an alias for another encoding.
			`No other lines are allowed in the file.`

			* `NAME=<name>` defines the name for this encoding. Required.

			* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
			`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
			If this directive is present, the only other allowed directive in the file is the `NAME` directive.

			* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
			Required, unless `BUILTIN` is present.
			There have to be two digits, `EOT=0` is invalid.

			* lines like `<xx>=<c>` where `<xx>` are two hex digits
			and `<c>` is either a non-whitespace character or a BMP Unicode codepoint written as `U+xxxx`,
			define the byte `<xx>` to correspond to character `<c>`.
			There have to be two digits, `0=@` is invalid.

			* lines like `<xx>-<xx>=<c><c><c><c>` where `<c>` is repeated an appropriate number of times
			`define characters for multiple byte values.`
			`In this kind of lines, characters cannot be represented as Unicode codepoints.`

			* lines like `<c>=<xx>`, `<c>=<xx><xx>` etc.
			`define secondary or alternate characters that are going to be represented as one or more bytes.`
			There have to be two digits, `@=0` is invalid.
			Problematic characters (space, `=`, `#`, `;`) can be written as Unicode codepoints `U+xxxx`.

			* a line like `a-z=<xx>` is equivalent to lines `a=<xx>`, `b=<xx+$01>` all the way to `z=<xx+$19>`.

			* a line like `KATAKANA=>DECOMPOSE` means that katakana characters with dakuten or handakuten
			`should be split into the base character and the standalone dakuten/handakuten.`

			* similarly with `HIRAGANA=>DECOMPOSE`.

			* lines like `{<escape code>}=<xx>`, `{<escape code>}=<xx><xx>` etc.
			`define escape codes. It's a good practice to define these when possible:`
			`{q}`, `{apos}`, `{n}`, `{lbrace}`, `{rbrace}`,
			`{yen}`, `{pound}`, `{cent}`, `{euro}`, `{copy}`, `{pi}`,
			`{nbsp}`, `{shy}`.