Add several more encodings

2026-04-26 10:20:51 +00:00 · 2021-03-13 21:39:48 +01:00
parent 0bbdc348e7
commit 66fc1d3984
12 changed files with 234 additions and 15 deletions
@@ -16,7 +16,7 @@ No other lines are allowed in the file.
 * `NAME=<name>` defines the name for this encoding. Required.

 * `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
-`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
+`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
 If this directive is present, the only other allowed directive in the file is the `NAME` directive.

 * `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
@@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
 You can override the values for `nullchar` and `nullchar_scr`
 by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively. 

-Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
+Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:

 * `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
 * the `string` module in the Millfork standard library will not work correctly
@@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
 You can also prepend `p` to the name of the encoding to make the string length-prefixed.

 The length is measured in bytes and doesn't include the zero terminator, if present.
-In all encodings except for UTF-16 the prefix takes one byte,
+In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
 which means that length-prefixed strings cannot be longer than 255 bytes.
 
 In case of UTF-16, the length prefix contains the number of code units,
 so the number of bytes divided by two,
 which allows for strings of practically unlimited length.
-The length is stores as two bytes and is always little endian,
+The length is stored as two bytes and is always little endian,
 even in case of the `utf16be` encoding or a big-endian processor.
+ 
+In case of UTF-32, the length prefix contains the number of Unicode codepoints,
+so the number of bytes divided by four.
+The length is stored as four bytes and is always little endian,
+even in case of the `utf32be` encoding or a big-endian processor.

        "this is a Pascal string" pascii
        "this is also a Pascal string"p
        "this is a zero-terminated Pascal string"pz

 Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
-To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
+To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):

    pointer p
    p = "test"pz
@@ -26,6 +26,8 @@ TODO: document the file format.

 * `oldpetscii` or `oldpet` – old PETSCII (Commodore PET with newer ROMs)

+* `geos_de` – text encoding used by the German version of GEOS for C64
+
 * `cbmscr` or `petscr` – Commodore screencodes

 * `cbmscrjp` or `petscrjp` – Commodore screencodes as used on Japanese versions of Commodore 64
@@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866

 * `kamenicky` – Kamenický encoding

-* `cp1250`, `cp1251`, `cp1252` – Windows codepages 1250, 1251, 1252
+* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` – Windows codepages 1250, 1251, 1252, 1253, 1254, 1257

 * `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively

@@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively

 * `utf16be`, `utf16le` – UTF-16BE and UTF-16LE

+* `utf32be`, `utf32le` – UTF-32BE and UTF-32LE
+
 When programming for Commodore,
 use `petscii` for strings you're printing using standard I/O routines
 and `petsciiscr` for strings you're copying to screen memory directly.
@@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
    * in the `zx80` encoding it's `{x01}`,
    * in the `zx81` encoding it's `{x0b}`,
    * in the `petscr` and `petscrjp` encodings it's `{xe0}`,
+    * in the `apple2e` encoding it's `{x7f}`,
    * in the `atasciiscr` encoding it's `{xdb}`,
    * in the `pokemon1*` encodings it's `{x50}`,
    * in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
    * in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
+    * in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
    * in other encodings it's `{x00}` (this may be a subject to change in future versions).

 ##### Available only in some encodings
@@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
 `petscr`            | yes¹ | no  | £    | none      | yes¹  
 `petjp`             | no   | no  | ¥    | katakana³ | yes³  
 `petscrjp`          | no   | no  | ¥    | katakana³ | yes³  
+`geos_de`           | yes  | no  |      |           | no  
 `sinclair`, `bbc`   | yes  | yes | £    | none      | no  
 `zx80`, `zx81`      | no   | no  | £    | none      | no  
 `apple2`            | no   | yes |      | none      | no  
@@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
 `origpet`           | yes | no  | no  | yes | no  | yes | no  
 `oldpet`            | yes | no  | no  | yes | no  | yes | no  
 `petscr`, `petscrjp`| no  | no  | no  | no  | no  | no  | no  
+`geos_de`           | no  | no  | no  | no  | no  | yes | no
 `sinclair`          | yes | yes | no  | yes | yes | yes | yes  
 `zx80`,`zx81`       | yes | no  | yes | yes | no  | no  | no  
 `ascii`, `iso_*`    | yes | yes | yes | no  | no  | no  | no