1
0
mirror of https://github.com/KarolS/millfork.git synced 2025-01-08 22:30:34 +00:00
millfork/docs/lang/custom-encoding.md

53 lines
2.4 KiB
Markdown
Raw Normal View History

2020-04-30 23:31:54 +00:00
[< back to index](../doc_index.md)
### Defining custom encodings
Every encoding is defined in an `.tbl` file with an appropriate name.
The file is looked up in the directories on the include path, first directly, then in the `encoding` subdirectory.
The file is a UTF-8 text file, with each line having a specific meaning.
In the specifications below, `<>` are not to be meant literally:
* lines starting with `#`, `;` or `//` are comments.
* `ALIAS=<another encoding name>` defines this encoding to be an alias for another encoding.
No other lines are allowed in the file.
* `NAME=<name>` defines the name for this encoding. Required.
* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
2021-03-13 20:39:48 +00:00
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
2020-04-30 23:31:54 +00:00
If this directive is present, the only other allowed directive in the file is the `NAME` directive.
* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
Required, unless `BUILTIN` is present.
There have to be two digits, `EOT=0` is invalid.
* lines like `<xx>=<c>` where `<xx>` are two hex digits
and `<c>` is either a **non-whitespace** character or a **BMP** Unicode codepoint written as `U+xxxx`,
define the byte `<xx>` to correspond to character `<c>`.
There have to be two digits, `0=@` is invalid.
* lines like `<xx>-<xx>=<c><c><c><c>` where `<c>` is repeated an appropriate number of times
define characters for multiple byte values.
In this kind of lines, characters cannot be represented as Unicode codepoints.
* lines like `<c>=<xx>`, `<c>=<xx><xx>` etc.
define secondary or alternate characters that are going to be represented as one or more bytes.
There have to be two digits, `@=0` is invalid.
Problematic characters (space, `=`, `#`, `;`) can be written as Unicode codepoints `U+xxxx`.
* a line like `a-z=<xx>` is equivalent to lines `a=<xx>`, `b=<xx+$01>` all the way to `z=<xx+$19>`.
* a line like `KATAKANA=>DECOMPOSE` means that katakana characters with dakuten or handakuten
should be split into the base character and the standalone dakuten/handakuten.
* similarly with `HIRAGANA=>DECOMPOSE`.
* lines like `{<escape code>}=<xx>`, `{<escape code>}=<xx><xx>` etc.
define escape codes. It's a good practice to define these when possible:
`{q}`, `{apos}`, `{n}`, `{lbrace}`, `{rbrace}`,
`{yen}`, `{pound}`, `{cent}`, `{euro}`, `{copy}`, `{pi}`,
`{nbsp}`, `{shy}`.