1
0
mirror of https://github.com/KarolS/millfork.git synced 2026-04-26 10:20:51 +00:00

Add several more encodings

This commit is contained in:
Karol Stasiak
2021-03-13 21:39:48 +01:00
parent 0bbdc348e7
commit 66fc1d3984
12 changed files with 234 additions and 15 deletions
+1 -1
View File
@@ -16,7 +16,7 @@ No other lines are allowed in the file.
* `NAME=<name>` defines the name for this encoding. Required.
* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
If this directive is present, the only other allowed directive in the file is the `NAME` directive.
* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
+9 -4
View File
@@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
You can override the values for `nullchar` and `nullchar_scr`
by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively.
Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:
* `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
* the `string` module in the Millfork standard library will not work correctly
@@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
You can also prepend `p` to the name of the encoding to make the string length-prefixed.
The length is measured in bytes and doesn't include the zero terminator, if present.
In all encodings except for UTF-16 the prefix takes one byte,
In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
which means that length-prefixed strings cannot be longer than 255 bytes.
In case of UTF-16, the length prefix contains the number of code units,
so the number of bytes divided by two,
which allows for strings of practically unlimited length.
The length is stores as two bytes and is always little endian,
The length is stored as two bytes and is always little endian,
even in case of the `utf16be` encoding or a big-endian processor.
In case of UTF-32, the length prefix contains the number of Unicode codepoints,
so the number of bytes divided by four.
The length is stored as four bytes and is always little endian,
even in case of the `utf32be` encoding or a big-endian processor.
"this is a Pascal string" pascii
"this is also a Pascal string"p
"this is a zero-terminated Pascal string"pz
Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):
pointer p
p = "test"pz
+9 -1
View File
@@ -26,6 +26,8 @@ TODO: document the file format.
* `oldpetscii` or `oldpet` old PETSCII (Commodore PET with newer ROMs)
* `geos_de` text encoding used by the German version of GEOS for C64
* `cbmscr` or `petscr` Commodore screencodes
* `cbmscrjp` or `petscrjp` Commodore screencodes as used on Japanese versions of Commodore 64
@@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866
* `kamenicky` Kamenický encoding
* `cp1250`, `cp1251`, `cp1252` Windows codepages 1250, 1251, 1252
* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` Windows codepages 1250, 1251, 1252, 1253, 1254, 1257
* `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` MSX character encoding, International, Japanese, Russian and Brazilian respectively
@@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively
* `utf16be`, `utf16le` UTF-16BE and UTF-16LE
* `utf32be`, `utf32le` UTF-32BE and UTF-32LE
When programming for Commodore,
use `petscii` for strings you're printing using standard I/O routines
and `petsciiscr` for strings you're copying to screen memory directly.
@@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
* in the `zx80` encoding it's `{x01}`,
* in the `zx81` encoding it's `{x0b}`,
* in the `petscr` and `petscrjp` encodings it's `{xe0}`,
* in the `apple2e` encoding it's `{x7f}`,
* in the `atasciiscr` encoding it's `{xdb}`,
* in the `pokemon1*` encodings it's `{x50}`,
* in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
* in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
* in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
* in other encodings it's `{x00}` (this may be a subject to change in future versions).
##### Available only in some encodings
@@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
`petscr` | yes¹ | no | £ | none | yes¹
`petjp` | no | no | ¥ | katakana³ | yes³
`petscrjp` | no | no | ¥ | katakana³ | yes³
`geos_de` | yes | no | | | no
`sinclair`, `bbc` | yes | yes | £ | none | no
`zx80`, `zx81` | no | no | £ | none | no
`apple2` | no | yes | | none | no
@@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
`origpet` | yes | no | no | yes | no | yes | no
`oldpet` | yes | no | no | yes | no | yes | no
`petscr`, `petscrjp`| no | no | no | no | no | no | no
`geos_de` | no | no | no | no | no | yes | no
`sinclair` | yes | yes | no | yes | yes | yes | yes
`zx80`,`zx81` | yes | no | yes | yes | no | no | no
`ascii`, `iso_*` | yes | yes | yes | no | no | no | no