1
0
mirror of https://github.com/KarolS/millfork.git synced 2024-06-09 16:29:34 +00:00

Add several more encodings

This commit is contained in:
Karol Stasiak 2021-03-13 21:39:48 +01:00
parent 0bbdc348e7
commit 66fc1d3984
12 changed files with 234 additions and 15 deletions

View File

@ -16,7 +16,7 @@ No other lines are allowed in the file.
* `NAME=<name>` defines the name for this encoding. Required.
* `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
If this directive is present, the only other allowed directive in the file is the `NAME` directive.
* `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.

View File

@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
You can override the values for `nullchar` and `nullchar_scr`
by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively.
Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:
* `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
* the `string` module in the Millfork standard library will not work correctly
@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
You can also prepend `p` to the name of the encoding to make the string length-prefixed.
The length is measured in bytes and doesn't include the zero terminator, if present.
In all encodings except for UTF-16 the prefix takes one byte,
In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
which means that length-prefixed strings cannot be longer than 255 bytes.
In case of UTF-16, the length prefix contains the number of code units,
so the number of bytes divided by two,
which allows for strings of practically unlimited length.
The length is stores as two bytes and is always little endian,
The length is stored as two bytes and is always little endian,
even in case of the `utf16be` encoding or a big-endian processor.
In case of UTF-32, the length prefix contains the number of Unicode codepoints,
so the number of bytes divided by four.
The length is stored as four bytes and is always little endian,
even in case of the `utf32be` encoding or a big-endian processor.
"this is a Pascal string" pascii
"this is also a Pascal string"p
"this is a zero-terminated Pascal string"pz
Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):
pointer p
p = "test"pz

View File

@ -26,6 +26,8 @@ TODO: document the file format.
* `oldpetscii` or `oldpet` old PETSCII (Commodore PET with newer ROMs)
* `geos_de` text encoding used by the German version of GEOS for C64
* `cbmscr` or `petscr` Commodore screencodes
* `cbmscrjp` or `petscrjp` Commodore screencodes as used on Japanese versions of Commodore 64
@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866
* `kamenicky` Kamenický encoding
* `cp1250`, `cp1251`, `cp1252` Windows codepages 1250, 1251, 1252
* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` Windows codepages 1250, 1251, 1252, 1253, 1254, 1257
* `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` MSX character encoding, International, Japanese, Russian and Brazilian respectively
@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively
* `utf16be`, `utf16le` UTF-16BE and UTF-16LE
* `utf32be`, `utf32le` UTF-32BE and UTF-32LE
When programming for Commodore,
use `petscii` for strings you're printing using standard I/O routines
and `petsciiscr` for strings you're copying to screen memory directly.
@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
* in the `zx80` encoding it's `{x01}`,
* in the `zx81` encoding it's `{x0b}`,
* in the `petscr` and `petscrjp` encodings it's `{xe0}`,
* in the `apple2e` encoding it's `{x7f}`,
* in the `atasciiscr` encoding it's `{xdb}`,
* in the `pokemon1*` encodings it's `{x50}`,
* in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
* in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
* in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
* in other encodings it's `{x00}` (this may be a subject to change in future versions).
##### Available only in some encodings
@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
`petscr` | yes¹ | no | £ | none | yes¹
`petjp` | no | no | ¥ | katakana³ | yes³
`petscrjp` | no | no | ¥ | katakana³ | yes³
`geos_de` | yes | no | | | no
`sinclair`, `bbc` | yes | yes | £ | none | no
`zx80`, `zx81` | no | no | £ | none | no
`apple2` | no | yes | | none | no
@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
`origpet` | yes | no | no | yes | no | yes | no
`oldpet` | yes | no | no | yes | no | yes | no
`petscr`, `petscrjp`| no | no | no | no | no | no | no
`geos_de` | no | no | no | no | no | yes | no
`sinclair` | yes | yes | no | yes | yes | yes | yes
`zx80`,`zx81` | yes | no | yes | yes | no | no | no
`ascii`, `iso_*` | yes | yes | yes | no | no | no | no

View File

@ -0,0 +1,39 @@
NAME=CP1253
EOT=00
20=U+0020
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
80=€
82-87=‚ƒ„…†‡
89=‰
8b=
91-97=‘’“”•–—
99=™
9b=
a1-ac=΅Ά£¤¥¦§¨©ͺ«¬
ae-af=®―
b0-bf=°±²³΄µ¶·ΈΉΊ»Ό½ΎΏ
c0-cf=ΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟ
d0-d1=ΠΡ
d3-df=ΣΤΥΦΧΨΩΪΫάέήί
e0-ef=ΰαβγδεζηθικλμνξο
f0-fe=πρςστυφχψωϊϋόύώ
{b}=08
{t}=09
{n}=0d0a
{q}=22
{apos}=27
{lbrace}=7b
{rbrace}=7d
{euro}=80
{pound}=a3
{yen}=a5
{copy}=a9
{pi}=f0
{nbsp}=A0
{shy}=AD

View File

@ -0,0 +1,34 @@
NAME=CP1254
EOT=00
20=U+0020
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
80=€
82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
91-9c=‘’“”•–—˜™š›œ
9f=Ÿ
a1-ac=¡¢£¤¥¦§¨©ª«¬
ae-af=®¯
b0-bf=°±²³´µ¶·¸¹º»¼½¾¿
c0-cf=ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
d0-df=ĞÑÒÓÔÕÖ×ØÙÚÛÜİŞß
e0-ef=àáâãäåæçèéêëìíîï
f0-ff=ğñòóôõö÷øùúûüışÿ
{b}=08
{t}=09
{n}=0d0a
{q}=22
{apos}=27
{lbrace}=7b
{rbrace}=7d
{euro}=80
{cent}=a2
{pound}=a3
{yen}=a5
{copy}=a9
{ss}=df
{nbsp}=A0
{shy}=AD

View File

@ -0,0 +1,36 @@
NAME=CP1257
EOT=00
20=U+0020
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
80=€
82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
8d-8f=¨ˇ¸
91-9c=‘’“”•–—˜™š›œ
9e-9f=¯˛
a1-ac=¡¢£¤¥¦§Ø©Ŗ«¬
ae-af=®Æ
b0-bf=°±²³´µ¶·ø¹ŗ»¼½¾æ
c0-cf=ĄĮĀĆÄÅĘĒČÉŹĖĢĶĪĻ
d0-df=ŠŃŅÓŌÕÖ×ŲŁŚŪÜŻŽß
e0-ef=ąįāćäåęēčéźėģķīļ
f0-ff=šńņóōõö÷ųłśūüżž˙
{b}=08
{t}=09
{n}=0d0a
{q}=22
{apos}=27
{lbrace}=7b
{rbrace}=7d
{euro}=80
{cent}=a2
{pound}=a3
{yen}=a5
{copy}=a9
{ss}=df
{nbsp}=A0
{shy}=AD

View File

@ -0,0 +1,20 @@
NAME=GEOS-DE
EOT=00
20=U+0020
21-3f=!"#$%&'()*+,-./0123456789:;<=>?
40-5f=§ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ^_
60-7e=`abcdefghijklmnopqrstuvwxyzäöüß
{b}=08
{t}=09
{n}=0d0a
{q}=22
{apos}=27
{AE}=5b
{OE}=5c
{UE}=5d
{ae}=7b
{oe}=7c
{ue}=7d
{ss}=7e

View File

@ -0,0 +1,2 @@
NAME=UTF-32BE
BUILTIN=UTF-32BE

View File

@ -0,0 +1,2 @@
NAME=UTF-32LE
BUILTIN=UTF-32LE

View File

@ -40,7 +40,27 @@ sealed trait TextCodec {
}
}
class UnicodeTextCodec(override val name: String, val charset: Charset, override val stringTerminator: List[Int]) extends TextCodec {
abstract class MappedTextCodec(override val name: String, inner: TextCodec) extends TextCodec {
override val supportsLowercase: Boolean = inner.supportsLowercase
override val stringTerminator: List[Int] = inner.stringTerminator.flatMap(this.mapWithEscaping)
override def encode(log: Logger, position: Option[Position], s: List[Int], options: CompilationOptions, lenient: Boolean): List[Int] =
inner.encode(log, position, s, options, lenient).flatMap(this.mapWithEscaping)
override def decode(by: Int): Char = TextCodec.NotAChar
override def encodeDigit(digit: Int): List[Int] = inner.encodeDigit(digit).flatMap(this.mapWithEscaping)
private def mapWithEscaping(byte: Int): List[Int] = {
if (byte < 0) List(-1 - byte)
else map(byte)
}
def map(byte: Int): List[Int]
}
class UnicodeTextCodec(override val name: String, val optionalCharset: Option[Charset], override val stringTerminator: List[Int], val escapeRawBytes: Boolean = false) extends TextCodec {
private val escapeSequences: Map[String, Char] = Map(
"n" -> '\n',
"r" -> '\r',
@ -66,7 +86,11 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
private def encodeEscapeSequence(log: Logger, escSeq: String, position: Option[Position], options: CompilationOptions, lenient: Boolean): List[Int] = {
if (escSeq.length == 3 && (escSeq(0) == 'X' || escSeq(0) == 'x' || escSeq(0) == '$')){
try {
return List(Integer.parseInt(escSeq.tail, 16))
var rawByte = Integer.parseInt(escSeq.tail, 16)
if (escapeRawBytes) {
rawByte = -1 - rawByte
}
return List(rawByte)
} catch {
case _: NumberFormatException =>
}
@ -112,18 +136,28 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
val (escSeq, closingBrace) = tail.span(_ != '}')
closingBrace match {
case '}' :: xs =>
encodeEscapeSequence(log, escSeq.mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
encodeEscapeSequence(log, escSeq.map(_.toChar).mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
case _ =>
log.error(f"Unclosed escape sequence", position)
Nil
}
case head :: tail =>
Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
optionalCharset match {
case Some(charset) =>
Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
case None =>
head :: encode(log, position, tail, options, lenient)
}
case Nil => Nil
}
}
def encodeDigit(digit: Int): List[Int] = digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
def encodeDigit(digit: Int): List[Int] =
optionalCharset match {
case Some(charset) =>
digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
case None => List('0'.toInt + digit)
}
override def decode(by: Int): Char = {
if (by >= 0x20 && by <= 0x7E) by.toChar

View File

@ -160,6 +160,8 @@ class TextCodecRepository(val includePath: List[String]) {
case "UTF-8" => Some(TextCodecRepository.Utf8)
case "UTF-16LE" => Some(TextCodecRepository.Utf16Le)
case "UTF-16BE" => Some(TextCodecRepository.Utf16Be)
case "UTF-32LE" => Some(TextCodecRepository.Utf32Le)
case "UTF-32BE" => Some(TextCodecRepository.Utf32Be)
case _ =>
log.error(s"Unknown built-in encoding $builtin for encoding $shortname")
None
@ -226,9 +228,21 @@ object TextCodecRepository {
val ESCAPE: Regex = "\\A\\{([\\w.'\\p{L}]+)}\\z".r
val CHAR: Regex = "\\A(\\S)\\z".r
val Utf8 = new UnicodeTextCodec("UTF-8", StandardCharsets.UTF_8, List(0))
val Utf8 = new UnicodeTextCodec("UTF-8", Some(StandardCharsets.UTF_8), List(0))
val Utf16Be = new UnicodeTextCodec("UTF-16BE", StandardCharsets.UTF_16BE, List(0, 0))
val Utf16Be = new UnicodeTextCodec("UTF-16BE", Some(StandardCharsets.UTF_16BE), List(0, 0))
val Utf16Le = new UnicodeTextCodec("UTF-16LE", StandardCharsets.UTF_16LE, List(0, 0))
}
val Utf16Le = new UnicodeTextCodec("UTF-16LE", Some(StandardCharsets.UTF_16LE), List(0, 0))
val RawUtf32 = new UnicodeTextCodec("UTF-32RAW", None, List(0))
private val RawEscapingUtf32 = new UnicodeTextCodec("UTF-32RAWEscaping", None, List(0), escapeRawBytes = true)
val Utf32Be: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
override def map(byte: Int): List[Int] = List((byte >>> 24) & 0xff, (byte >>> 16) & 0xff, (byte >>> 8) & 0xff, (byte >>> 0) & 0xff)
}
val Utf32Le: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
override def map(byte: Int): List[Int] = List((byte >>> 0) & 0xff, (byte >>> 8) & 0xff, (byte >>> 16) & 0xff, (byte >>> 24) & 0xff)
}
}

View File

@ -80,4 +80,29 @@ class TextCodecSuite extends FunSuite with Matchers {
""".stripMargin)
m.readByte(0xc000) should equal(13)
}
test("UTF-32") {
val m = EmuUnoptimizedRun(
"""
|pointer output @$c000
|byte output2 @$c002
|array test = "a"utf32bez
| void main() {
| output = test.addr
| output2 = test.length
| }
""".stripMargin)
m.readWord(0xc002) should equal(8)
val addr = m.readWord(0xc000)
m.readByte(addr + 0) should equal(0)
m.readByte(addr + 1) should equal(0)
m.readByte(addr + 2) should equal(0)
m.readByte(addr + 3) should equal(97)
m.readByte(addr + 4) should equal(0)
m.readByte(addr + 5) should equal(0)
m.readByte(addr + 6) should equal(0)
m.readByte(addr + 7) should equal(0)
}
}