Add several more encodings

2024-06-25 19:29:49 +00:00 · 2021-03-13 21:39:48 +01:00 · 2021-03-13 21:39:48 +01:00 · 66fc1d3984
commit 66fc1d3984
parent 0bbdc348e7
12 changed files with 234 additions and 15 deletions
--- a/docs/lang/custom-encoding.md
+++ b/docs/lang/custom-encoding.md
@ -16,7 +16,7 @@ No other lines are allowed in the file.
 * `NAME=<name>` defines the name for this encoding. Required.

 * `BUILTIN=<internal name>` defines this encoding to be a UTF-based encoding.
-`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`.
+`<internal name>` may be one of `UTF-8`, `UTF-16LE`, `UTF-16BE`, `UTF-32LE`, `UTF-32BE`.
 If this directive is present, the only other allowed directive in the file is the `NAME` directive.

 * `EOT=<xx>` where `<xx>` are two hex digits, defines the string terminator byte.
--- a/docs/lang/literals.md
+++ b/docs/lang/literals.md
@ -65,7 +65,7 @@ and the byte constant `nullchar_scr` is defined to be equal to the string termin
 You can override the values for `nullchar` and `nullchar_scr`
 by defining preprocessor features `NULLCHAR` and `NULLCHAR_SCR` respectively. 

-Warning: If you define UTF-16 to be you default or screen encoding, you will encounter several problems:
+Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:

 * `nullchar` and `nullchar_scr` will still be bytes, equal to zero.
 * the `string` module in the Millfork standard library will not work correctly
@ -75,21 +75,26 @@ Warning: If you define UTF-16 to be you default or screen encoding, you will enc
 You can also prepend `p` to the name of the encoding to make the string length-prefixed.

 The length is measured in bytes and doesn't include the zero terminator, if present.
-In all encodings except for UTF-16 the prefix takes one byte,
+In all encodings except for UTF-16 and UTF-32 the prefix takes one byte,
 which means that length-prefixed strings cannot be longer than 255 bytes.
 
 In case of UTF-16, the length prefix contains the number of code units,
 so the number of bytes divided by two,
 which allows for strings of practically unlimited length.
-The length is stores as two bytes and is always little endian,
+The length is stored as two bytes and is always little endian,
 even in case of the `utf16be` encoding or a big-endian processor.
+ 
+In case of UTF-32, the length prefix contains the number of Unicode codepoints,
+so the number of bytes divided by four.
+The length is stored as four bytes and is always little endian,
+even in case of the `utf32be` encoding or a big-endian processor.

        "this is a Pascal string" pascii
        "this is also a Pascal string"p
        "this is a zero-terminated Pascal string"pz

 Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string!
-To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2):
+To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):

    pointer p
    p = "test"pz
--- a/docs/lang/text.md
+++ b/docs/lang/text.md
@ -26,6 +26,8 @@ TODO: document the file format.

 * `oldpetscii` or `oldpet` – old PETSCII (Commodore PET with newer ROMs)

+* `geos_de` – text encoding used by the German version of GEOS for C64
+
 * `cbmscr` or `petscr` – Commodore screencodes

 * `cbmscrjp` or `petscrjp` – Commodore screencodes as used on Japanese versions of Commodore 64
@ -89,7 +91,7 @@ DOS codepages 437, 850, 851, 852, 855, 858, 866

 * `kamenicky` – Kamenický encoding

-* `cp1250`, `cp1251`, `cp1252` – Windows codepages 1250, 1251, 1252
+* `cp1250`, `cp1251`, `cp1252`, `cp1253`, `cp1254`, `cp1257` – Windows codepages 1250, 1251, 1252, 1253, 1254, 1257

 * `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively

@ -132,6 +134,8 @@ English, Japanese, Spanish/Italian and French/German respectively

 * `utf16be`, `utf16le` – UTF-16BE and UTF-16LE

+* `utf32be`, `utf32le` – UTF-32BE and UTF-32LE
+
 When programming for Commodore,
 use `petscii` for strings you're printing using standard I/O routines
 and `petsciiscr` for strings you're copying to screen memory directly.
@ -163,10 +167,12 @@ The exact value of `{nullchar}` is encoding-dependent:
    * in the `zx80` encoding it's `{x01}`,
    * in the `zx81` encoding it's `{x0b}`,
    * in the `petscr` and `petscrjp` encodings it's `{xe0}`,
+    * in the `apple2e` encoding it's `{x7f}`,
    * in the `atasciiscr` encoding it's `{xdb}`,
    * in the `pokemon1*` encodings it's `{x50}`,
    * in the `cocoscr` encoding it's exceptionally two bytes: `{xd0}`
    * in the `utf16be` and `utf16le` encodings it's exceptionally two bytes: `{x00}{x00}`
+    * in the `utf32be` and `utf32le` encodings it's exceptionally four bytes: `{x00}{x00}{x00}{x00}`
    * in other encodings it's `{x00}` (this may be a subject to change in future versions).

 ##### Available only in some encodings
@ -211,6 +217,7 @@ Encoding | lowercase letters | backslash | currencies | intl | card suits
 `petscr`            | yes¹ | no  | £    | none      | yes¹  
 `petjp`             | no   | no  | ¥    | katakana³ | yes³  
 `petscrjp`          | no   | no  | ¥    | katakana³ | yes³  
+`geos_de`           | yes  | no  |      |           | no  
 `sinclair`, `bbc`   | yes  | yes | £    | none      | no  
 `zx80`, `zx81`      | no   | no  | £    | none      | no  
 `apple2`            | no   | yes |      | none      | no  
@ -273,6 +280,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
 `origpet`           | yes | no  | no  | yes | no  | yes | no  
 `oldpet`            | yes | no  | no  | yes | no  | yes | no  
 `petscr`, `petscrjp`| no  | no  | no  | no  | no  | no  | no  
+`geos_de`           | no  | no  | no  | no  | no  | yes | no
 `sinclair`          | yes | yes | no  | yes | yes | yes | yes  
 `zx80`,`zx81`       | yes | no  | yes | yes | no  | no  | no  
 `ascii`, `iso_*`    | yes | yes | yes | no  | no  | no  | no  
--- a/include/encoding/cp1253.tbl
+++ b/include/encoding/cp1253.tbl
@ -0,0 +1,39 @@
+NAME=CP1253
+EOT=00
+
+20=U+0020
+21-3f=!"#$%&'()*+,-./0123456789:;<=>?
+40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
+60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
+80=€
+82-87=‚ƒ„…†‡
+89=‰
+8b=‹
+91-97=‘’“”•–—
+99=™
+9b=›
+a1-ac=΅Ά£¤¥¦§¨©ͺ«¬
+ae-af=®―
+b0-bf=°±²³΄µ¶·ΈΉΊ»Ό½ΎΏ
+c0-cf=ΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟ
+d0-d1=ΠΡ
+d3-df=ΣΤΥΦΧΨΩΪΫάέήί
+e0-ef=ΰαβγδεζηθικλμνξο
+f0-fe=πρςστυφχψωϊϋόύώ
+
+
+
+{b}=08
+{t}=09
+{n}=0d0a
+{q}=22
+{apos}=27
+{lbrace}=7b
+{rbrace}=7d
+{euro}=80
+{pound}=a3
+{yen}=a5
+{copy}=a9
+{pi}=f0
+{nbsp}=A0
+{shy}=AD
--- a/include/encoding/cp1254.tbl
+++ b/include/encoding/cp1254.tbl
@ -0,0 +1,34 @@
+NAME=CP1254
+EOT=00
+
+20=U+0020
+21-3f=!"#$%&'()*+,-./0123456789:;<=>?
+40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
+60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
+80=€
+82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
+91-9c=‘’“”•–—˜™š›œ
+9f=Ÿ
+a1-ac=¡¢£¤¥¦§¨©ª«¬
+ae-af=®¯
+b0-bf=°±²³´µ¶·¸¹º»¼½¾¿
+c0-cf=ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
+d0-df=ĞÑÒÓÔÕÖ×ØÙÚÛÜİŞß
+e0-ef=àáâãäåæçèéêëìíîï
+f0-ff=ğñòóôõö÷øùúûüışÿ
+
+{b}=08
+{t}=09
+{n}=0d0a
+{q}=22
+{apos}=27
+{lbrace}=7b
+{rbrace}=7d
+{euro}=80
+{cent}=a2
+{pound}=a3
+{yen}=a5
+{copy}=a9
+{ss}=df
+{nbsp}=A0
+{shy}=AD
--- a/include/encoding/cp1257.tbl
+++ b/include/encoding/cp1257.tbl
@ -0,0 +1,36 @@
+NAME=CP1257
+EOT=00
+
+20=U+0020
+21-3f=!"#$%&'()*+,-./0123456789:;<=>?
+40-5f=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
+60-7e=`abcdefghijklmnopqrstuvwxyz{|}~
+80=€
+82-8c=‚ƒ„…†‡ˆ‰Š‹Œ
+8d-8f=¨ˇ¸
+91-9c=‘’“”•–—˜™š›œ
+9e-9f=¯˛
+a1-ac=¡¢£¤¥¦§Ø©Ŗ«¬
+ae-af=®Æ
+b0-bf=°±²³´µ¶·ø¹ŗ»¼½¾æ
+c0-cf=ĄĮĀĆÄÅĘĒČÉŹĖĢĶĪĻ
+d0-df=ŠŃŅÓŌÕÖ×ŲŁŚŪÜŻŽß
+e0-ef=ąįāćäåęēčéźėģķīļ
+f0-ff=šńņóōõö÷ųłśūüżž˙
+
+
+{b}=08
+{t}=09
+{n}=0d0a
+{q}=22
+{apos}=27
+{lbrace}=7b
+{rbrace}=7d
+{euro}=80
+{cent}=a2
+{pound}=a3
+{yen}=a5
+{copy}=a9
+{ss}=df
+{nbsp}=A0
+{shy}=AD
--- a/include/encoding/geos_de.tbl
+++ b/include/encoding/geos_de.tbl
@ -0,0 +1,20 @@
+NAME=GEOS-DE
+EOT=00
+
+20=U+0020
+21-3f=!"#$%&'()*+,-./0123456789:;<=>?
+40-5f=§ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜ^_
+60-7e=`abcdefghijklmnopqrstuvwxyzäöüß
+
+{b}=08
+{t}=09
+{n}=0d0a
+{q}=22
+{apos}=27
+{AE}=5b
+{OE}=5c
+{UE}=5d
+{ae}=7b
+{oe}=7c
+{ue}=7d
+{ss}=7e
--- a/include/encoding/utf32be.tbl
+++ b/include/encoding/utf32be.tbl
@ -0,0 +1,2 @@
+NAME=UTF-32BE
+BUILTIN=UTF-32BE
--- a/include/encoding/utf32le.tbl
+++ b/include/encoding/utf32le.tbl
@ -0,0 +1,2 @@
+NAME=UTF-32LE
+BUILTIN=UTF-32LE
--- a/src/main/scala/millfork/parser/TextCodec.scala
+++ b/src/main/scala/millfork/parser/TextCodec.scala
@ -40,7 +40,27 @@ sealed trait TextCodec {
  }
 }

-class UnicodeTextCodec(override val name: String, val charset: Charset, override val stringTerminator: List[Int]) extends TextCodec {
+abstract class MappedTextCodec(override val name: String, inner: TextCodec) extends TextCodec {
+  override val supportsLowercase: Boolean = inner.supportsLowercase
+
+  override val stringTerminator: List[Int] = inner.stringTerminator.flatMap(this.mapWithEscaping)
+
+  override def encode(log: Logger, position: Option[Position], s: List[Int], options: CompilationOptions, lenient: Boolean): List[Int] =
+    inner.encode(log, position, s, options, lenient).flatMap(this.mapWithEscaping)
+
+  override def decode(by: Int): Char = TextCodec.NotAChar
+
+  override def encodeDigit(digit: Int): List[Int] = inner.encodeDigit(digit).flatMap(this.mapWithEscaping)
+
+  private def mapWithEscaping(byte: Int): List[Int] = {
+    if (byte < 0) List(-1 - byte)
+    else map(byte)
+  }
+
+  def map(byte: Int): List[Int]
+}
+
+class UnicodeTextCodec(override val name: String, val optionalCharset: Option[Charset], override val stringTerminator: List[Int], val escapeRawBytes: Boolean = false) extends TextCodec {
  private val escapeSequences: Map[String, Char] = Map(
    "n" -> '\n',
    "r" -> '\r',
@ -66,7 +86,11 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
  private def encodeEscapeSequence(log: Logger, escSeq: String, position: Option[Position], options: CompilationOptions, lenient: Boolean): List[Int] = {
    if (escSeq.length == 3 && (escSeq(0) == 'X' || escSeq(0) == 'x' || escSeq(0) == '$')){
      try {
-        return List(Integer.parseInt(escSeq.tail, 16))
+        var rawByte = Integer.parseInt(escSeq.tail, 16)
+        if (escapeRawBytes) {
+          rawByte = -1 - rawByte
+        }
+        return List(rawByte)
      } catch {
        case _: NumberFormatException =>
      }
@ -112,18 +136,28 @@ class UnicodeTextCodec(override val name: String, val charset: Charset, override
        val (escSeq, closingBrace) = tail.span(_ != '}')
        closingBrace match {
          case '}' :: xs =>
-            encodeEscapeSequence(log, escSeq.mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
+            encodeEscapeSequence(log, escSeq.map(_.toChar).mkString(""), position, options, lenient) ++ encode(log, position, xs, options, lenient)
          case _ =>
            log.error(f"Unclosed escape sequence", position)
            Nil
        }
      case head :: tail =>
-        Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
+        optionalCharset match {
+          case Some(charset) =>
+            Character.toChars(head).mkString("").getBytes(charset).map(_.&(0xff)).toList ++ encode(log, position, tail, options, lenient)
+          case None =>
+            head :: encode(log, position, tail, options, lenient)
+        }
      case Nil => Nil
    }
  }

-  def encodeDigit(digit: Int): List[Int] =  digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
+  def encodeDigit(digit: Int): List[Int] =
+    optionalCharset match {
+      case Some(charset) =>
+        digit.toString.getBytes(charset).map(_.toInt.&(0xff)).toList
+      case None => List('0'.toInt + digit)
+    }

  override def decode(by: Int): Char = {
    if (by >= 0x20 && by <= 0x7E) by.toChar
--- a/src/main/scala/millfork/parser/TextCodecRepository.scala
+++ b/src/main/scala/millfork/parser/TextCodecRepository.scala
@ -160,6 +160,8 @@ class TextCodecRepository(val includePath: List[String]) {
      case "UTF-8" => Some(TextCodecRepository.Utf8)
      case "UTF-16LE" => Some(TextCodecRepository.Utf16Le)
      case "UTF-16BE" => Some(TextCodecRepository.Utf16Be)
+      case "UTF-32LE" => Some(TextCodecRepository.Utf32Le)
+      case "UTF-32BE" => Some(TextCodecRepository.Utf32Be)
      case _ =>
        log.error(s"Unknown built-in encoding $builtin for encoding $shortname")
        None
@ -226,9 +228,21 @@ object TextCodecRepository {
  val ESCAPE: Regex = "\\A\\{([\\w.'\\p{L}]+)}\\z".r
  val CHAR: Regex = "\\A(\\S)\\z".r

-  val Utf8 = new UnicodeTextCodec("UTF-8", StandardCharsets.UTF_8, List(0))
+  val Utf8 = new UnicodeTextCodec("UTF-8", Some(StandardCharsets.UTF_8), List(0))

-  val Utf16Be = new UnicodeTextCodec("UTF-16BE", StandardCharsets.UTF_16BE, List(0, 0))
+  val Utf16Be = new UnicodeTextCodec("UTF-16BE", Some(StandardCharsets.UTF_16BE), List(0, 0))

-  val Utf16Le = new UnicodeTextCodec("UTF-16LE", StandardCharsets.UTF_16LE, List(0, 0))
-}
+  val Utf16Le = new UnicodeTextCodec("UTF-16LE", Some(StandardCharsets.UTF_16LE), List(0, 0))
+
+  val RawUtf32 = new UnicodeTextCodec("UTF-32RAW", None, List(0))
+
+  private val RawEscapingUtf32 = new UnicodeTextCodec("UTF-32RAWEscaping", None, List(0), escapeRawBytes = true)
+
+  val Utf32Be: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
+    override def map(byte: Int): List[Int] = List((byte >>> 24) & 0xff, (byte >>> 16) & 0xff, (byte >>> 8) & 0xff, (byte >>> 0) & 0xff)
+  }
+
+  val Utf32Le: TextCodec = new MappedTextCodec("UTF-32BE", RawEscapingUtf32) {
+    override def map(byte: Int): List[Int] = List((byte >>> 0) & 0xff, (byte >>> 8) & 0xff, (byte >>> 16) & 0xff, (byte >>> 24) & 0xff)
+  }
+}
--- a/src/test/scala/millfork/test/TextCodecSuite.scala
+++ b/src/test/scala/millfork/test/TextCodecSuite.scala
@ -80,4 +80,29 @@ class TextCodecSuite extends FunSuite with Matchers {
      """.stripMargin)
    m.readByte(0xc000) should equal(13)
  }
+
+  test("UTF-32") {
+    val m = EmuUnoptimizedRun(
+      """
+        |pointer output @$c000
+        |byte output2 @$c002
+        |array test = "a"utf32bez
+        | void main() {
+        |   output = test.addr
+        |   output2 = test.length
+        | }
+      """.stripMargin)
+    m.readWord(0xc002) should equal(8)
+    val addr = m.readWord(0xc000)
+    m.readByte(addr + 0) should equal(0)
+    m.readByte(addr + 1) should equal(0)
+    m.readByte(addr + 2) should equal(0)
+    m.readByte(addr + 3) should equal(97)
+    m.readByte(addr + 4) should equal(0)
+    m.readByte(addr + 5) should equal(0)
+    m.readByte(addr + 6) should equal(0)
+    m.readByte(addr + 7) should equal(0)
+  }
+
+
 }