mirror of
https://github.com/KarolS/millfork.git
synced 2026-04-26 10:20:51 +00:00
Add several more encodings
This commit is contained in:
@@ -16,7 +16,7 @@ No other lines are allowed in the file.
|
||||
* `NAME=<name>` defines the name for this encoding. Required.
|
||||
|
||||
* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
|
||||
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
|
||||
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
|
||||
If this directive is present, the only other allowed directive in the file is the `NAME` directive.
|
||||
|
||||
* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
|
||||
|
||||
@@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
|
||||
You can override the values for `nullchar` and `nullchar_scr`
|
||||
by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively.
|
||||
|
||||
Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
|
||||
Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:
|
||||
|
||||
* `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
|
||||
* the `string` module in the Millfork standard library will not work correctly
|
||||
@@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
|
||||
You can also prepend `p` to the name of the encoding to make the string length-prefixed.
|
||||
|
||||
The length is measured in bytes and doesn't include the zero terminator, if present.
|
||||
In all encodings except for UTF-16 the prefix takes one byte,
|
||||
In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
|
||||
which means that length-prefixed strings cannot be longer than 255 bytes.
|
||||
|
||||
In case of UTF-16, the length prefix contains the number of code units,
|
||||
so the number of bytes divided by two,
|
||||
which allows for strings of practically unlimited length.
|
||||
The length is stores as two bytes and is always little endian,
|
||||
The length is stored as two bytes and is always little endian,
|
||||
even in case of the `utf16be` encoding or a big-endian processor.
|
||||
|
||||
In case of UTF-32, the length prefix contains the number of Unicode codepoints,
|
||||
so the number of bytes divided by four.
|
||||
The length is stored as four bytes and is always little endian,
|
||||
even in case of the `utf32be` encoding or a big-endian processor.
|
||||
|
||||
"this is a Pascal string" pascii
|
||||
"this is also a Pascal string"p
|
||||
"this is a zero-terminated Pascal string"pz
|
||||
|
||||
Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
|
||||
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
|
||||
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):
|
||||
|
||||
pointer p
|
||||
p = "test"pz
|
||||
|
||||
+9
-1
@@ -26,6 +26,8 @@ TODO: document the file format.
|
||||
|
||||
* `oldpetscii` or `oldpet` – old PETSCII (Commodore PET with newer ROMs)
|
||||
|
||||
* `geos_de` – text encoding used by the German version of GEOS for C64
|
||||
|
||||
* `cbmscr` or `petscr` – Commodore screencodes
|
||||
|
||||
* `cbmscrjp` or `petscrjp` – Commodore screencodes as used on Japanese versions of Commodore 64
|
||||
@@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866
|
||||
|
||||
* `kamenicky` – Kamenický encoding
|
||||
|
||||
* `cp1250`, `cp1251`, `cp1252` – Windows codepages 1250, 1251, 1252
|
||||
* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` – Windows codepages 1250, 1251, 1252, 1253, 1254, 1257
|
||||
|
||||
* `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively
|
||||
|
||||
@@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively
|
||||
|
||||
* `utf16be`, `utf16le` – UTF-16BE and UTF-16LE
|
||||
|
||||
* `utf32be`, `utf32le` – UTF-32BE and UTF-32LE
|
||||
|
||||
When programming for Commodore,
|
||||
use `petscii` for strings you're printing using standard I/O routines
|
||||
and `petsciiscr` for strings you're copying to screen memory directly.
|
||||
@@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
|
||||
* in the `zx80` encoding it's `{x01}`,
|
||||
* in the `zx81` encoding it's `{x0b}`,
|
||||
* in the `petscr` and `petscrjp` encodings it's `{xe0}`,
|
||||
* in the `apple2e` encoding it's `{x7f}`,
|
||||
* in the `atasciiscr` encoding it's `{xdb}`,
|
||||
* in the `pokemon1*` encodings it's `{x50}`,
|
||||
* in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
|
||||
* in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
|
||||
* in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
|
||||
* in other encodings it's `{x00}` (this may be a subject to change in future versions).
|
||||
|
||||
##### Available only in some encodings
|
||||
@@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
|
||||
`petscr` | yes¹ | no | £ | none | yes¹
|
||||
`petjp` | no | no | ¥ | katakana³ | yes³
|
||||
`petscrjp` | no | no | ¥ | katakana³ | yes³
|
||||
`geos_de` | yes | no | | | no
|
||||
`sinclair`, `bbc` | yes | yes | £ | none | no
|
||||
`zx80`, `zx81` | no | no | £ | none | no
|
||||
`apple2` | no | yes | | none | no
|
||||
@@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
|
||||
`origpet` | yes | no | no | yes | no | yes | no
|
||||
`oldpet` | yes | no | no | yes | no | yes | no
|
||||
`petscr`, `petscrjp`| no | no | no | no | no | no | no
|
||||
`geos_de` | no | no | no | no | no | yes | no
|
||||
`sinclair` | yes | yes | no | yes | yes | yes | yes
|
||||
`zx80`,`zx81` | yes | no | yes | yes | no | no | no
|
||||
`ascii`, `iso_*` | yes | yes | yes | no | no | no | no
|
||||
|
||||
Reference in New Issue
Block a user