The big text encoding overhaul

2026-04-25 03:16:44 +00:00 · 2020-05-01 01:31:54 +02:00
parent a0aa9d418d
commit 7f9bd18bdd
132 changed files with 2453 additions and 697 deletions
@@ -26,6 +26,8 @@

 * [List of text encodings and escape sequences](lang/text.md)

+* [Defining custom encodings](lang/custom-encoding.md)
+
 * [Operators reference](lang/operators.md)

 * [Functions](lang/functions.md)
@@ -0,0 +1,52 @@
+[< back to index](../doc_index.md)
+
+### Defining custom encodings
+
+Every encoding is defined in an `.tbl` file with an appropriate name.
+The file is looked up in the directories on the include path, first directly, then in the `encoding` subdirectory.
+
+The file is a UTF-8 text file, with each line having a specific meaning.
+In the specifications below, `<>` are not to be meant literally:
+
+* lines starting with `#`, `;` or `//` are comments.
+
+* `ALIAS=<another encoding name>` defines this encoding to be an alias for another encoding.
+No other lines are allowed in the file.
+
+* `NAME=<name>` defines the name for this encoding. Required.
+
+* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
+`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
+If this directive is present, the only other allowed directive in the file is the `NAME` directive.
+
+* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
+Required, unless `BUILTIN` is present.
+There have to be two digits, `EOT=0` is invalid.
+
+* lines like `<xx>=<c>` where `<xx>` are two hex digits
+and `<c>` is either a **non-whitespace** character or a **BMP** Unicode codepoint written as `U+xxxx`,
+define the byte `<xx>` to correspond to character `<c>`.
+There have to be two digits, `0=@` is invalid.
+
+* lines like `<xx>-<xx>=<c><c><c><c>` where `<c>` is repeated an appropriate number of times
+define characters for multiple byte values.
+In this kind of lines, characters cannot be represented as Unicode codepoints.
+
+* lines like `<c>=<xx>`, `<c>=<xx><xx>` etc.
+define secondary or alternate characters that are going to be represented as one or more bytes.
+There have to be two digits, `@=0` is invalid.
+Problematic characters (space, `=`, `#`, `;`) can be written as Unicode codepoints `U+xxxx`.
+
+* a line like `a-z=<xx>` is equivalent to lines `a=<xx>`, `b=<xx+$01>` all the way to `z=<xx+$19>`.
+
+* a line like `KATAKANA=>DECOMPOSE` means that katakana characters with dakuten or handakuten
+should be split into the base character and the standalone dakuten/handakuten.
+
+* similarly with `HIRAGANA=>DECOMPOSE`.
+
+* lines like `{<escape code>}=<xx>`, `{<escape code>}=<xx><xx>` etc.
+define escape codes. It's a good practice to define these when possible:
+`{q}`, `{apos}`, `{n}`, `{lbrace}`, `{rbrace}`, 
+`{yen}`, `{pound}`, `{cent}`, `{euro}`, `{copy}`, `{pi}`,
+`{nbsp}`, `{shy}`.
+
@@ -1,6 +1,13 @@
 [< back to index](../doc_index.md)

-# Text encodings ans escape sequences
+# Text encodings and escape sequences
+
+### Defining custom encodings
+
+Every platform is defined in an `.tbl` file with an appropriate name.
+The file is looked up in the directories on the include path, first directly, then in the `encoding` subdirectory.
+
+TODO: document the file format.

 ### Text encoding list

@@ -11,19 +18,25 @@

 * `ascii` – standard ASCII

-* `pet` or `petscii` – PETSCII (ASCII-like character set used by Commodore machines from VIC-20 onward)
+* `petscii` or `pet` – PETSCII (ASCII-like character set used by Commodore machines from VIC-20 onward)

-* `petjp` or `petsciijp` – PETSCII as used on Japanese versions of Commodore 64
+* `petsciijp` or `petjp` – PETSCII as used on Japanese versions of Commodore 64

-* `origpet` or `origpetscii` – old PETSCII (Commodore PET with original ROMs)
+* `origpetscii` or `origpet` – old PETSCII (Commodore PET with original ROMs)

-* `oldpet` or `oldpetscii` – old PETSCII (Commodore PET with newer ROMs)
+* `oldpetscii` or `oldpet` – old PETSCII (Commodore PET with newer ROMs)

 * `cbmscr` or `petscr` – Commodore screencodes

 * `cbmscrjp` or `petscrjp` – Commodore screencodes as used on Japanese versions of Commodore 64

-* `apple2` – Apple II charset ($A0–$DF)
+* `apple2` – original Apple II charset ($A0–$DF)
+
+* `apple2e` – Apple IIe charset
+
+* `apple2c` – alternative Apple IIc charset
+
+* `apple2gs` – Apple IIgs charset

 * `bbc` – BBC Micro character set

@@ -37,15 +50,51 @@

 * `iso_de`, `iso_no`, `iso_se`, `iso_yu` – various variants of ISO/IEC-646
 
-* `iso_dk`, `iso_fi` – aliases for `iso_no` and `iso_se` respectively
+    * `iso_dk`, `iso_fi` – aliases for `iso_no` and `iso_se` respectively

-* `iso15` – ISO 8859-15
+* `iso8859_1`, `iso8859_2`, `iso8859_3`,
+`iso8859_4`, `iso8859_5`, `iso8859_7`,
+`iso8859_9`, `iso8859_10`, `iso8859_13`,
+`iso8859_14`, `iso8859_15`, `iso8859_13` – 
+ISO 8859-1, ISO 8859-2, ISO 8859-3,
+ISO 8859-4, ISO 8859-5, ISO 8859-7,
+ISO 8859-9, ISO 8859-10, ISO 8859-13,
+ISO 8859-14, ISO 8859-15, ISO 8859-16,

-* `latin0`, `latin9`, `iso8859_15` – aliases for `iso15`
+    * `iso1`, `latin1` – aliases for `iso8859_1`
+    * `iso2`, `latin2` – aliases for `iso8859_2`
+    * `iso3`, `latin3` – aliases for `iso8859_3`
+    * `iso4`, `latin4` – aliases for `iso8859_4`
+    * `iso5` – alias for `iso8859_5`
+    * `iso7` – alias for `iso8859_7`
+    * `iso9`, `latin5`, – aliases for `iso8859_9`
+    * `iso10`, `latin6` – aliases for `iso8859_10`
+    * `iso13`, `latin7` – aliases for `iso8859_13`
+    * `iso14`, `latin8` – aliases for `iso8859_14`
+    * `iso_15`, `latin9`, `latin0` – aliases for `iso8859_15`
+    * `iso16`, `latin10` – aliases for `iso8859_16`
+
+* `cp437`, `cp850`, `cp851`, `cp852`, `cp855`, `cp858`, `cp866` –
+DOS codepages 437, 850, 851, 852, 855, 858, 866
+
+* `mazovia` – Mazovia encoding
+
+* `kamenicky` – Kamenický encoding
+
+* `cp1250`, `cp1251`, `cp1252` – Windows codepages 1250, 1251, 1252

 * `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively

-* `msx_us`, `msx_uk`, `msx_fr`, `msx_de` – aliases for `msx_intl`
+    * `msx_us`, `msx_uk`, `msx_fr`, `msx_de` – aliases for `msx_intl`
+    
+* `cpc_en`, `cpc_fr`, `cpc_es`, `cpc_da` – Amstrad CPC character encoding, English, French, Spanish and Danish respectively
+
+* `pcw` or `amstrad_cpm` – Amstrad CP/M encoding, the US variant (language 0), as used on PCW machines
+
+* `pokemon1en`, `pokemon1jp`, `pokemon1es`, `pokemon1fr` – text encodings used in 1st generation Pokémon games,
+English, Japanese, Spanish/Italian and French/German respectively
+
+    * `pokemon1it`, `pokemon1de` – aliases for `pokemon1es` and `pokemon1fr` respectively
 
 * `atascii` or `atari` – ATASCII as seen on Atari 8-bit computers
 
@@ -55,13 +104,21 @@

 * `vectrex` – built-in Vectrex font

+* `galaksija` – text encoding used on Galaksija computers
+
+* `ebcdic` – EBCDIC codepage 037 (partial coverage)
+
 * `utf8` – UTF-8

 * `utf16be`, `utf16le` – UTF-16BE and UTF-16LE

 When programming for Commodore,
-use `pet` for strings you're printing using standard I/O routines
-and `petscr` for strings you're copying to screen memory directly.
+use `petscii` for strings you're printing using standard I/O routines
+and `petsciiscr` for strings you're copying to screen memory directly.
+
+When programming for Atari,
+use `atascii` for strings you're printing using standard I/O routines
+and `atasciiscr` for strings you're copying to screen memory directly.

 ### Escape sequences

@@ -71,8 +128,6 @@ Some escape sequences may expand to multiple characters. For example, in several

 ##### Available everywhere

-* `{q}` – double quote symbol
-
 * `{x00}`–`{xff}` – a character of the given hexadecimal value

 * `{copyright_year}` – this expands to the current year in digits
@@ -89,12 +144,15 @@ The exact value of `{nullchar}` is encoding-dependent:
    * in the `zx81` encoding it's `{x0b}`,
    * in the `petscr` and `petscrjp` encodings it's `{xe0}`,
    * in the `atasciiscr` encoding it's `{xdb}`,
+    * in the `pokemon1*` encodings it's `{x50}`,
    * in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
    * in other encodings it's `{x00}` (this may be a subject to change in future versions).

 ##### Available only in some encodings

-* `{apos}` – apostrophe/single quote (available everywhere except for `zx80` and `zx81`)
+* `{apos}` – apostrophe/single quote (available everywhere except for `zx80`, `zx81` and `galaksija`)
+
+* `{q}` – double quote symbol (available everywhere except for `pokemon1*` encodings)

 * `{n}` – new line

@@ -105,19 +163,25 @@ The exact value of `{nullchar}` is encoding-dependent:
 * `{up}`, `{down}`, `{left}`, `{right}` – control codes for moving the cursor

 * `{white}`, `{black}`, `{red}`, `{green}`, `{blue}`, `{cyan}`, `{yellow}`, `{purple}` – 
-control codes for changing the text color
+control codes for changing the text color (`petscii`, `petsciijp`, `sinclair` only)

 * `{bgwhite}`, `{bgblack}`, `{bgred}`, `{bggreen}`, `{bgblue}`, `{bgcyan}`, `{bgyellow}`, `{bgpurple}` – 
-control codes for changing the text background color
+control codes for changing the text background color (`sinclair` only)

 * `{reverse}`, `{reverseoff}` – inverted mode on/off

 * `{yen}`, `{pound}`, `{cent}`, `{euro}`, `{copy}` – yen symbol, pound symbol, cent symbol, euro symbol, copyright symbol

+* `{nbsp}`, `{shy}` – non-breaking space, soft hyphen
+
+* `{pi}` – letter π
+
 * `{u0000}`–`{u1fffff}` – Unicode codepoint (available in UTF encodings only)

 ##### Character availability

+For ISO/DOS/Windows/UTF encodings, consult external sources.
+
 Encoding | lowercase letters | backslash | currencies | intl | card suits  
 ---------|-------------------|-----------|------------|------|-----------  
 `pet`,              | yes¹ | no  | £    | none      | yes¹  
@@ -132,14 +196,20 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
 `atascii`           | yes  | yes |      | none      | yes  
 `atasciiscr`        | yes  | yes |      | none      | yes  
 `jis`               | yes  | no  | ¥    | both kana | no  
-`iso15`             | yes  | yes | €¢£¥ | Western   | no   
 `msx_intl`,`msx_br` | yes  | yes | ¢£¥  | Western   | yes   
 `msx_jp`            | yes  | no  | ¥    | katakana  | yes   
 `msx_ru`            | yes  | yes |      | Russian⁴  | yes   
 `koi7n2`            | no   | yes |      | Russian⁵  | no   
+`cpc_en`            | yes  | yes | £    | none      | yes 
+`cpc_es`            | yes  | yes |      | Spanish⁶  | yes 
+`cpc_fr`            | yes  | no  | £    | French⁷   | yes 
+`cpc_da`            | yes  | no  | £    | Nor/Dan.  | yes 
 `vectrex`           | no   | yes |      | none      | no   
-`utf*`              | yes  | yes | all  | all       | yes  
-all the rest        | yes  | yes |      | none      | no  
+`pokemon1jp`        | no   | no  |      | both kana | no
+`pokemon1en`        | yes  | no  |      | none      | no
+`pokemon1fr`        | yes  | no  |      | Ger/Fre.  | no
+`pokemon1es`        | yes  | no  |      | Spa/Ita.  | no
+`galaksija`         | no   | no  |      | Yugoslav⁸ | no
  
 1. `pet`, `origpet` and `petscr` cannot display card suit symbols and lowercase letters at the same time.
 Card suit symbols are only available in graphics mode,
@@ -155,6 +225,12 @@ Card suit symbols are only available in graphics mode, in which katakana is disp

 5. Only uppercase. Letters **Ё** and **Ъ** are not available.

+6. No accented vowels.
+
+7. Some accented vowels are not available.
+
+8. Letter **Đ** is not available.
+
 If the encoding does not support lowercase letters (e.g. `apple2`, `petjp`, `petscrjp`, `koi7n2`, `vectrex`),
 then text and character literals containing lowercase letters are automatically converted to uppercase. 
 Only unaccented Latin and Cyrillic letters will be converted as such.
@@ -163,6 +239,8 @@ To detect if your default encoding does not support lowercase letters, test `'A'

 ##### Escape sequence availability

+The table below may be incomplete.
+
 Encoding | new line | braces | backspace | cursor movement | text colour | reverse | background colour  
 ---------|----------|--------|-----------|-----------------|-------------|---------|------------------  
 `pet`,`petjp`       | yes | no  | no  | yes | yes | yes | no  
@@ -172,8 +250,11 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
 `sinclair`          | yes | yes | no  | yes | yes | yes | yes  
 `zx80`,`zx81`       | yes | no  | yes | yes | no  | no  | no  
 `ascii`, `iso_*`    | yes | yes | yes | no  | no  | no  | no  
-`iso15`             | yes | yes | yes | no  | no  | no  | no  
+`iso8869_*`, `cp*`  | yes | yes | yes | no  | no  | no  | no  
 `apple2`            | no  | yes | no  | no  | no  | no  | no  
+`apple2`            | no  | no  | no  | no  | no  | no  | no  
+`apple2e`           | no  | yes | no  | no  | no  | no  | no  
+`apple2gs`          | no  | yes | no  | no  | no  | no  | no  
 `atascii`           | yes | no  | yes | yes | no  | no  | no  
 `atasciiscr`        | no  | no  | no  | no  | no  | no  | no  
 `msx_*`             | yes | yes | yes | yes | no  | no  | no