mirror of
https://github.com/KarolS/millfork.git
synced 2024-06-09 16:29:34 +00:00
Add several more encodings
This commit is contained in:
parent
0bbdc348e7
commit
66fc1d3984
|
@ -16,7 +16,7 @@ No other lines are allowed in the file.
|
|||
* `NAME=<name>` defines the name for this encoding. Required.
|
||||
|
||||
* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
|
||||
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
|
||||
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
|
||||
If this directive is present, the only other allowed directive in the file is the `NAME` directive.
|
||||
|
||||
* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
|
||||
|
|
|
@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
|
|||
You can override the values for `nullchar` and `nullchar_scr`
|
||||
by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively.
|
||||
|
||||
Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
|
||||
Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:
|
||||
|
||||
* `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
|
||||
* the `string` module in the Millfork standard library will not work correctly
|
||||
|
@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
|
|||
You can also prepend `p` to the name of the encoding to make the string length-prefixed.
|
||||
|
||||
The length is measured in bytes and doesn't include the zero terminator, if present.
|
||||
In all encodings except for UTF-16 the prefix takes one byte,
|
||||
In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
|
||||
which means that length-prefixed strings cannot be longer than 255 bytes.
|
||||
|
||||
In case of UTF-16, the length prefix contains the number of code units,
|
||||
so the number of bytes divided by two,
|
||||
which allows for strings of practically unlimited length.
|
||||
The length is stores as two bytes and is always little endian,
|
||||
The length is stored as two bytes and is always little endian,
|
||||
even in case of the `utf16be` encoding or a big-endian processor.
|
||||
|
||||
In case of UTF-32, the length prefix contains the number of Unicode codepoints,
|
||||
so the number of bytes divided by four.
|
||||
The length is stored as four bytes and is always little endian,
|
||||
even in case of the `utf32be` encoding or a big-endian processor.
|
||||
|
||||
"this is a Pascal string" pascii
|
||||
"this is also a Pascal string"p
|
||||
"this is a zero-terminated Pascal string"pz
|
||||
|
||||
Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
|
||||
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
|
||||
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):
|
||||
|
||||
pointer p
|
||||
p = "test"pz
|
||||
|
|
|
@ -26,6 +26,8 @@ TODO: document the file format.
|
|||
|
||||
* `oldpetscii` or `oldpet` – old PETSCII (Commodore PET with newer ROMs)
|
||||
|
||||
* `geos_de` – text encoding used by the German version of GEOS for C64
|
||||
|
||||
* `cbmscr` or `petscr` – Commodore screencodes
|
||||
|
||||
* `cbmscrjp` or `petscrjp` – Commodore screencodes as used on Japanese versions of Commodore 64
|
||||
|
@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866
|
|||
|
||||
* `kamenicky` – Kamenický encoding
|
||||
|
||||
* `cp1250`, `cp1251`, `cp1252` – Windows codepages 1250, 1251, 1252
|
||||
* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` – Windows codepages 1250, 1251, 1252, 1253, 1254, 1257
|
||||
|
||||
* `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively
|
||||
|
||||
|
@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively
|
|||
|
||||
* `utf16be`, `utf16le` – UTF-16BE and UTF-16LE
|
||||
|
||||
* `utf32be`, `utf32le` – UTF-32BE and UTF-32LE
|
||||
|
||||
When programming for Commodore,
|
||||
use `petscii` for strings you're printing using standard I/O routines
|
||||
and `petsciiscr` for strings you're copying to screen memory directly.
|
||||
|
@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
|
|||
* in the `zx80` encoding it's `{x01}`,
|
||||
* in the `zx81` encoding it's `{x0b}`,
|
||||
* in the `petscr` and `petscrjp` encodings it's `{xe0}`,
|
||||
* in the `apple2e` encoding it's `{x7f}`,
|
||||
* in the `atasciiscr` encoding it's `{xdb}`,
|
||||
* in the `pokemon1*` encodings it's `{x50}`,
|
||||
* in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
|
||||
* in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
|
||||
* in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
|
||||
* in other encodings it's `{x00}` (this may be a subject to change in future versions).
|
||||
|
||||
##### Available only in some encodings
|
||||
|
@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
|
|||
`petscr` | yes¹ | no | £ | none | yes¹
|
||||
`petjp` | no | no | ¥ | katakana³ | yes³
|
||||
`petscrjp` | no | no | ¥ | katakana³ | yes³
|
||||
`geos_de` | yes | no | | | no
|
||||
`sinclair`, `bbc` | yes | yes | £ | none | no
|
||||
`zx80`, `zx81` | no | no | £ | none | no
|
||||
`apple2` | no | yes | | none | no
|
||||
|
@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
|
|||
`origpet` | yes | no | no | yes | no | yes | no
|
||||
`oldpet` | yes | no | no | yes | no | yes | no
|
||||
`petscr`, `petscrjp`| no | no | no | no | no | no | no
|
||||
`geos_de` | no | no | no | no | no | yes | no
|
||||
`sinclair` | yes | yes | no | yes | yes | yes | yes
|
||||
`zx80`,`zx81` | yes | no | yes | yes | no | no | no
|
||||
`ascii`, `iso_*` | yes | yes | yes | no | no | no | no
|
||||
|
|
39
include/encoding/cp1253.tbl
Normal file
39
include/encoding/cp1253.tbl
Normal file
|
@ -0,0 +1,39 @@
|
|||
NAME=CP1253
|
||||
EOT=00
|
||||
|
||||
20=U+0020
|
||||
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
|
||||
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
|
||||
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
|
||||
80=€
|
||||
82-87=‚ƒ„…†‡
|
||||
89=‰
|
||||
8b=‹
|
||||
91-97=‘’“”•–—
|
||||
99=™
|
||||
9b=›
|
||||
a1-ac=΅Ά£¤¥¦§¨©ͺ«¬
|
||||
ae-af=®―
|
||||
b0-bf=°±²³΄µ¶·ΈΉΊ»Ό½ΎΏ
|
||||
c0-cf=ΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟ
|
||||
d0-d1=ΠΡ
|
||||
d3-df=ΣΤΥΦΧΨΩΪΫάέήί
|
||||
e0-ef=ΰαβγδεζηθικλμνξο
|
||||
f0-fe=πρςστυφχψωϊϋόύώ
|
||||
|
||||
|
||||
|
||||
{b}=08
|
||||
{t}=09
|
||||
{n}=0d0a
|
||||
{q}=22
|
||||
{apos}=27
|
||||
{lbrace}=7b
|
||||
{rbrace}=7d
|
||||
{euro}=80
|
||||
{pound}=a3
|
||||
{yen}=a5
|
||||
{copy}=a9
|
||||
{pi}=f0
|
||||
{nbsp}=A0
|
||||
{shy}=AD
|
34
include/encoding/cp1254.tbl
Normal file
34
include/encoding/cp1254.tbl
Normal file
|
@ -0,0 +1,34 @@
|
|||
NAME=CP1254
|
||||
EOT=00
|
||||
|
||||
20=U+0020
|
||||
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
|
||||
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
|
||||
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
|
||||
80=€
|
||||
82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
|
||||
91-9c=‘’“”•–—˜™š›œ
|
||||
9f=Ÿ
|
||||
a1-ac=¡¢£¤¥¦§¨©ª«¬
|
||||
ae-af=®¯
|
||||
b0-bf=°±²³´µ¶·¸¹º»¼½¾¿
|
||||
c0-cf=ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
|
||||
d0-df=ĞÑÒÓÔÕÖ×ØÙÚÛÜİŞß
|
||||
e0-ef=àáâãäåæçèéêëìíîï
|
||||
f0-ff=ğñòóôõö÷øùúûüışÿ
|
||||
|
||||
{b}=08
|
||||
{t}=09
|
||||
{n}=0d0a
|
||||
{q}=22
|
||||
{apos}=27
|
||||
{lbrace}=7b
|
||||
{rbrace}=7d
|
||||
{euro}=80
|
||||
{cent}=a2
|
||||
{pound}=a3
|
||||
{yen}=a5
|
||||
{copy}=a9
|
||||
{ss}=df
|
||||
{nbsp}=A0
|
||||
{shy}=AD
|
36
include/encoding/cp1257.tbl
Normal file
36
include/encoding/cp1257.tbl
Normal file
|
@ -0,0 +1,36 @@
|
|||
NAME=CP1257
|
||||
EOT=00
|
||||
|
||||
20=U+0020
|
||||
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
|
||||
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
|
||||
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
|
||||
80=€
|
||||
82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
|
||||
8d-8f=¨ˇ¸
|
||||
91-9c=‘’“”•–—˜™š›œ
|
||||
9e-9f=¯˛
|
||||
a1-ac=¡¢£¤¥¦§Ø©Ŗ«¬
|
||||
ae-af=®Æ
|
||||
b0-bf=°±²³´µ¶·ø¹ŗ»¼½¾æ
|
||||
c0-cf=ĄĮĀĆÄÅĘĒČÉŹĖĢĶĪĻ
|
||||
d0-df=ŠŃŅÓŌÕÖ×ŲŁŚŪÜŻŽß
|
||||
e0-ef=ąįāćäåęēčéźėģķīļ
|
||||
f0-ff=šńņóōõö÷ųłśūüżž˙
|
||||
|
||||
|
||||
{b}=08
|
||||
{t}=09
|
||||
{n}=0d0a
|
||||
{q}=22
|
||||
{apos}=27
|
||||
{lbrace}=7b
|
||||
{rbrace}=7d
|
||||
{euro}=80
|
||||
{cent}=a2
|
||||
{pound}=a3
|
||||
{yen}=a5
|
||||
{copy}=a9
|
||||
{ss}=df
|
||||
{nbsp}=A0
|
||||
{shy}=AD
|
20
include/encoding/geos_de.tbl
Normal file
20
include/encoding/geos_de.tbl
Normal file
|
@ -0,0 +1,20 @@
|
|||
NAME=GEOS-DE
|
||||
EOT=00
|
||||
|
||||
20=U+0020
|
||||
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
|
||||
40-5f=§ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ^_
|
||||
60-7e=`abcdefghijklmnopqrstuvwxyzäöüß
|
||||
|
||||
{b}=08
|
||||
{t}=09
|
||||
{n}=0d0a
|
||||
{q}=22
|
||||
{apos}=27
|
||||
{AE}=5b
|
||||
{OE}=5c
|
||||
{UE}=5d
|
||||
{ae}=7b
|
||||
{oe}=7c
|
||||
{ue}=7d
|
||||
{ss}=7e
|
2
include/encoding/utf32be.tbl
Normal file
2
include/encoding/utf32be.tbl
Normal file
|
@ -0,0 +1,2 @@
|
|||
NAME=UTF-32BE
|
||||
BUILTIN=UTF-32BE
|
2
include/encoding/utf32le.tbl
Normal file
2
include/encoding/utf32le.tbl
Normal file
|
@ -0,0 +1,2 @@
|
|||
NAME=UTF-32LE
|
||||
BUILTIN=UTF-32LE
|
|
@ -40,7 +40,27 @@ sealed trait TextCodec {
|
|||
}
|
||||
}
|
||||
|
||||
class UnicodeTextCodec(override val name: String, val charset: Charset, override val stringTerminator: List[Int]) extends TextCodec {
|
||||
abstract class MappedTextCodec(override val name: String, inner: TextCodec) extends TextCodec {
|
||||
override val supportsLowercase: Boolean = inner.supportsLowercase
|
||||
|
||||
override val stringTerminator: List[Int] = inner.stringTerminator.flatMap(this.mapWithEscaping)
|
||||
|
||||
override def encode(log: Logger, position: Option[Position], s: List[Int], options: CompilationOptions, lenient: Boolean): List[Int] =
|
||||
inner.encode(log, position, s, options, lenient).flatMap(this.mapWithEscaping)
|
||||
|
||||
override def decode(by: Int): Char = TextCodec.NotAChar
|
||||
|
||||
override def encodeDigit(digit: Int): List[Int] = inner.encodeDigit(digit).flatMap(this.mapWithEscaping)
|
||||
|
||||
private def mapWithEscaping(byte: Int): List[Int] = {
|
||||
if (byte < 0) List(-1 - byte)
|
||||
else map(byte)
|
||||
}
|
||||
|
||||
def map(byte: Int): List[Int]
|
||||
}
|
||||
|
||||
class UnicodeTextCodec(override val name: String, val optionalCharset: Option[Charset], override val stringTerminator: List[Int], val escapeRawBytes: Boolean = false) extends TextCodec {
|
||||
private val escapeSequences: Map[String, Char] = Map(
|
||||
"n" -> '\n',
|
||||
"r" -> '\r',
|
||||
|
@ -66,7 +86,11 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
|
|||
private def encodeEscapeSequence(log: Logger, escSeq: String, position: Option[Position], options: CompilationOptions, lenient: Boolean): List[Int] = {
|
||||
if (escSeq.length == 3 && (escSeq(0) == 'X' || escSeq(0) == 'x' || escSeq(0) == '$')){
|
||||
try {
|
||||
return List(Integer.parseInt(escSeq.tail, 16))
|
||||
var rawByte = Integer.parseInt(escSeq.tail, 16)
|
||||
if (escapeRawBytes) {
|
||||
rawByte = -1 - rawByte
|
||||
}
|
||||
return List(rawByte)
|
||||
} catch {
|
||||
case _: NumberFormatException =>
|
||||
}
|
||||
|
@ -112,18 +136,28 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
|
|||
val (escSeq, closingBrace) = tail.span(_ != '}')
|
||||
closingBrace match {
|
||||
case '}' :: xs =>
|
||||
encodeEscapeSequence(log, escSeq.mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
|
||||
encodeEscapeSequence(log, escSeq.map(_.toChar).mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
|
||||
case _ =>
|
||||
log.error(f"Unclosed escape sequence", position)
|
||||
Nil
|
||||
}
|
||||
case head :: tail =>
|
||||
Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
|
||||
optionalCharset match {
|
||||
case Some(charset) =>
|
||||
Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
|
||||
case None =>
|
||||
head :: encode(log, position, tail, options, lenient)
|
||||
}
|
||||
case Nil => Nil
|
||||
}
|
||||
}
|
||||
|
||||
def encodeDigit(digit: Int): List[Int] = digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
|
||||
def encodeDigit(digit: Int): List[Int] =
|
||||
optionalCharset match {
|
||||
case Some(charset) =>
|
||||
digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
|
||||
case None => List('0'.toInt + digit)
|
||||
}
|
||||
|
||||
override def decode(by: Int): Char = {
|
||||
if (by >= 0x20 && by <= 0x7E) by.toChar
|
||||
|
|
|
@ -160,6 +160,8 @@ class TextCodecRepository(val includePath: List[String]) {
|
|||
case "UTF-8" => Some(TextCodecRepository.Utf8)
|
||||
case "UTF-16LE" => Some(TextCodecRepository.Utf16Le)
|
||||
case "UTF-16BE" => Some(TextCodecRepository.Utf16Be)
|
||||
case "UTF-32LE" => Some(TextCodecRepository.Utf32Le)
|
||||
case "UTF-32BE" => Some(TextCodecRepository.Utf32Be)
|
||||
case _ =>
|
||||
log.error(s"Unknown built-in encoding $builtin for encoding $shortname")
|
||||
None
|
||||
|
@ -226,9 +228,21 @@ object TextCodecRepository {
|
|||
val ESCAPE: Regex = "\\A\\{([\\w.'\\p{L}]+)}\\z".r
|
||||
val CHAR: Regex = "\\A(\\S)\\z".r
|
||||
|
||||
val Utf8 = new UnicodeTextCodec("UTF-8", StandardCharsets.UTF_8, List(0))
|
||||
val Utf8 = new UnicodeTextCodec("UTF-8", Some(StandardCharsets.UTF_8), List(0))
|
||||
|
||||
val Utf16Be = new UnicodeTextCodec("UTF-16BE", StandardCharsets.UTF_16BE, List(0, 0))
|
||||
val Utf16Be = new UnicodeTextCodec("UTF-16BE", Some(StandardCharsets.UTF_16BE), List(0, 0))
|
||||
|
||||
val Utf16Le = new UnicodeTextCodec("UTF-16LE", StandardCharsets.UTF_16LE, List(0, 0))
|
||||
}
|
||||
val Utf16Le = new UnicodeTextCodec("UTF-16LE", Some(StandardCharsets.UTF_16LE), List(0, 0))
|
||||
|
||||
val RawUtf32 = new UnicodeTextCodec("UTF-32RAW", None, List(0))
|
||||
|
||||
private val RawEscapingUtf32 = new UnicodeTextCodec("UTF-32RAWEscaping", None, List(0), escapeRawBytes = true)
|
||||
|
||||
val Utf32Be: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
|
||||
override def map(byte: Int): List[Int] = List((byte >>> 24) & 0xff, (byte >>> 16) & 0xff, (byte >>> 8) & 0xff, (byte >>> 0) & 0xff)
|
||||
}
|
||||
|
||||
val Utf32Le: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
|
||||
override def map(byte: Int): List[Int] = List((byte >>> 0) & 0xff, (byte >>> 8) & 0xff, (byte >>> 16) & 0xff, (byte >>> 24) & 0xff)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -80,4 +80,29 @@ class TextCodecSuite extends FunSuite with Matchers {
|
|||
""".stripMargin)
|
||||
m.readByte(0xc000) should equal(13)
|
||||
}
|
||||
|
||||
test("UTF-32") {
|
||||
val m = EmuUnoptimizedRun(
|
||||
"""
|
||||
|pointer output @$c000
|
||||
|byte output2 @$c002
|
||||
|array test = "a"utf32bez
|
||||
| void main() {
|
||||
| output = test.addr
|
||||
| output2 = test.length
|
||||
| }
|
||||
""".stripMargin)
|
||||
m.readWord(0xc002) should equal(8)
|
||||
val addr = m.readWord(0xc000)
|
||||
m.readByte(addr + 0) should equal(0)
|
||||
m.readByte(addr + 1) should equal(0)
|
||||
m.readByte(addr + 2) should equal(0)
|
||||
m.readByte(addr + 3) should equal(97)
|
||||
m.readByte(addr + 4) should equal(0)
|
||||
m.readByte(addr + 5) should equal(0)
|
||||
m.readByte(addr + 6) should equal(0)
|
||||
m.readByte(addr + 7) should equal(0)
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue
Block a user