Add KOI-7 N2 and MSX-BR encodings. Some encoding enhancements.

2024-06-25 19:29:49 +00:00 · 2019-09-18 00:09:37 +02:00 · 2019-09-18 00:09:37 +02:00 · c45cf7d51d
commit c45cf7d51d
parent 08ef0beeb7
3 changed files with 151 additions and 37 deletions
--- a/docs/lang/literals.md
+++ b/docs/lang/literals.md
@ -76,6 +76,8 @@ For example, if `-flenient-encoding` is enabled, then a literal `"£¥↑ž©ß"

 * `"?¥^z(C)ss"` if the default encoding is `jisx`

+* `"£¥^z(C)β"` if the default encoding is `msx_intl`
+
 Note that the final length of the string may vary.

 ## Character literals
--- a/docs/lang/text.md
+++ b/docs/lang/text.md
@ -35,7 +35,7 @@
 
 * `iso_dk`, `iso_fi` – aliases for `iso_no` and `iso_se` respectively

-* `msx_intl`, `msx_jp`, `msx_ru` – MSX character encoding, International, Japanese and Russian respectively
+* `msx_intl`, `msx_jp`, `msx_ru`, `msx_br` – MSX character encoding, International, Japanese, Russian and Brazilian respectively

 * `msx_us`, `msx_uk`, `msx_fr`, `msx_de` – aliases for `msx_intl`
 
@ -43,6 +43,8 @@
 
 * `atasciiscr` or `atariscr` – screencodes used by Atari 8-bit computers

+* `koi7n2` or `short_koi` – KOI-7 N2
+
 * `vectrex` – built-in Vectrex font

 When programming for Commodore,
@ -93,23 +95,24 @@ control codes for changing the text background color

 ##### Character availability

-Encoding | lowercase letters | backslash | pound | yen | katakana | card suits  
--|--|--|--|--|--|--  
-`pet`, `origpet`   | yes¹ | no  | no  | no   | no   | yes¹  
-`oldpet`           | yes² | no  | no  | no   | no   | yes²  
-`petscr`           | yes¹ | no  | yes | no   | no   | yes¹  
-`petjp`            | no   | no  | no  | yes  | yes³ | yes³  
-`petscrjp`         | no   | no  | no  | yes  | yes³ | yes³  
-`sinclair`, `bbc`  | yes  | yes | yes | no   | no   | no  
-`apple2`           | no   | yes | no  | no   | no   | no  
-`atascii`          | yes  | yes | no  | no   | no   | yes  
-`atasciiscr`       | yes  | yes | no  | no   | no   | yes  
-`jis`              | yes  | no  | no  | yes  | yes  | no  
-`msx_intl`         | yes  | yes | yes | yes  | no   | yes   
-`msx_jp`           | yes  | no  | no  | yes  | yes  | yes   
-`msx_ru`           | yes  | yes | no  | no   | no   | yes   
-`vectrex`          | no   | yes | no  | no   | no   | no   
-all the rest       | yes  | yes | no  | no   | no   | no  
+Encoding | lowercase letters | backslash | pound | yen | intl | card suits  
+---------|-------------------|-----------|-------|-----|------|-----------  
+`pet`, `origpet`    | yes¹ | no  | no  | no   | none      | yes¹  
+`oldpet`            | yes² | no  | no  | no   | none      | yes²  
+`petscr`            | yes¹ | no  | yes | no   | none      | yes¹  
+`petjp`             | no   | no  | no  | yes  | katakana³ | yes³  
+`petscrjp`          | no   | no  | no  | yes  | katakana³ | yes³  
+`sinclair`, `bbc`   | yes  | yes | yes | no   | none      | no  
+`apple2`            | no   | yes | no  | no   | none      | no  
+`atascii`           | yes  | yes | no  | no   | none      | yes  
+`atasciiscr`        | yes  | yes | no  | no   | none      | yes  
+`jis`               | yes  | no  | no  | yes  | both kana | no  
+`msx_intl`,`msx_br` | yes  | yes | yes | yes  | Western   | yes   
+`msx_jp`            | yes  | no  | no  | yes  | katakana  | yes   
+`msx_ru`            | yes  | yes | no  | no   | Russian⁴   | yes   
+`koi7n2`            | no   | yes | no  | no   | Russian⁵  | no   
+`vectrex`           | no   | yes | no  | no   | none      | no   
+all the rest        | yes  | yes | no  | no   | none      | no  
  
 1. `pet`, `origpet` and `petscr` cannot display card suit symbols and lowercase letters at the same time.
 Card suit symbols are only available in graphics mode,
@ -118,16 +121,23 @@ in which lowercase letters are displayed as uppercase and uppercase letters are
 2.  `oldpet` cannot display card suit symbols and lowercase letters at the same time.
 Card suit symbols are only available in graphics mode, in which lowercase letters are displayed as symbols. 

-3. `petjp` and `petscrjp` cannot display card suit symbols and katakana at the same time
+3. `petjp` and `petscrjp` cannot display card suit symbols and katakana at the same time.
 Card suit symbols are only available in graphics mode, in which katakana is displayed as symbols. 

-If the encoding does not support lowercase letters (e.g. `apple2`, `petjp`, `petscrjp`),
+4. Letter **Ё** and uppercase **Ъ** are not available.
+
+5. Only uppercase. Letters **Ё** and **Ъ** are not available.
+
+If the encoding does not support lowercase letters (e.g. `apple2`, `petjp`, `petscrjp`, `koi7n2`, `vectrex`),
 then text and character literals containing lowercase letters are automatically converted to uppercase. 
+Only unaccented Latin and Cyrillic letters will be converted as such.
+Accented Latin letters will not be converted and will fail to compile without `-flenient-encoding`.     
+To detect if your default encoding does not support lowercase letters, test `'A' == 'a'`.

 ##### Escape sequence availability

 Encoding | new line | braces | backspace | cursor movement | text colour | reverse | background colour  
--|--|--|--|--|--|--|--  
+---------|----------|--------|-----------|-----------------|-------------|---------|------------------  
 `pet`,`petjp`       | yes | no  | no  | yes | yes | yes | no  
 `origpet`           | yes | no  | no  | yes | no  | yes | no  
 `oldpet`            | yes | no  | no  | yes | no  | yes | no  
@ -137,5 +147,7 @@ Encoding | new line | braces | backspace | cursor movement | text colour | rever
 `apple2`            | no  | yes | no  | no  | no  | no  | no  
 `atascii`           | yes | no  | yes | yes | no  | no  | no  
 `atasciiscr`        | no  | no  | no  | no  | no  | no  | no  
+`msx_*`             | yes | yes | yes | yes | no  | no  | no  
+`koi7n2`            | yes | no  | yes | no  | no  | no  | no  
 `vectrex`           | no  | no  | no  | no  | no  | no  | no  
 all the rest        | yes | yes | no  | no  | no  | no  | no
--- a/src/main/scala/millfork/parser/TextCodec.scala
+++ b/src/main/scala/millfork/parser/TextCodec.scala
@ -181,7 +181,10 @@ object TextCodec {
      case (_, "msx_es") => TextCodec.MsxWest
      case (_, "msx_ru") => TextCodec.MsxRu
      case (_, "msx_jp") => TextCodec.MsxJp
+      case (_, "msx_br") => TextCodec.MsxBr
      case (_, "vectrex") => TextCodec.Vectrex
+      case (_, "koi7n2") => TextCodec.Koi7N2
+      case (_, "short_koi") => TextCodec.Koi7N2
      case (p, _) =>
        log.error(s"Unknown string encoding: `$name`", p)
        TextCodec.Ascii
@ -441,6 +444,25 @@ object TextCodec {
    )
  )

+  val Koi7N2 = new TextCodec("KOI-7 N2", 0,
+    "\ufffd" * 32 +
+      " !\"#¤%&'()*+,-./" +
+      "0123456789:;<=>?" +
+      "@ABCDEFGHIJKLMNO" +
+      "PQRSTUVWXYZ[\\]^_" +
+      "ЮАБЦДЕФГХИЙКЛМНО" +
+      "ПЯРСТУЖВЬЫЗШЭЩЧ",
+    Map('↑' -> 0x5E, '$' -> 0x24) ++
+      ('a' to 'z').map(l => l -> l.toUpper.toInt).toMap ++
+      ('а' to 'я').filter(_ != 'ъ').map(l => l -> l.toUpper.toInt).toMap,
+    Map.empty, Map(
+      "n" -> List(13), // TODO: ?
+      "b" -> List(8), // TODO: ?
+      "q" -> List('\"'.toInt),
+      "apos" -> List('\''.toInt)
+    )
+  )
+
  val OldPetscii = new TextCodec("Old PETSCII", 0,
    "\ufffd" * 32 +
      0x20.to(0x3f).map(_.toChar).mkString +
@ -600,15 +622,46 @@ object TextCodec {
      "Δ\ufffdω\ufffd\ufffd\ufffd\ufffd\ufffd" +
      "αβΓΠΣσµγΦθΩδ∞∅∈∩" +
      "≡±≥≤\ufffd\ufffd÷\ufffd\ufffd\ufffd\ufffd\ufffdⁿ²",
-    Map('ß' -> 0xE1, '¦' -> 0x7C),
+    Map('ß' -> 0xE1, '¦' -> 0x7C, 'Ő' -> 0xB4, 'ő' -> 0xB5, 'Ű' -> 0xB6, 'ű' -> 0xB7),
    Map('♥' -> "\u0001C", '♡' -> "\u0001C", '♢' -> "\u0001D", '♢' -> "\u0001D", '♣' -> "\u0001E", '♠' -> "\u0001F", '·' -> "\u0001G") ,
    MinimalEscapeSequencesWithBraces ++ Map(
+      "right" -> List(0x1c),
+      "left" -> List(0x1d),
+      "up" -> List(0x1e),
+      "down" -> List(0x1f),
+      "b" -> List(8),
      "n" -> List(13, 10),
      "pound" -> List(0x9c),
      "yen" -> List(0x9d),
    )
  )

+  val MsxBr = new TextCodec("MSX-BR", 0,
+    "\ufffd" * 32 +
+      (0x20 to 0x7e).map(_.toChar).mkString("") +
+      "\ufffd" +
+      "ÇüéâÁà\ufffdçêÍÓÚÂÊÔÀ" +
+      "ÉæÆôöòûùÿÖÜ¢£¥₧ƒ" +
+      "áíóúñÑªº¿⌐¬½¼¡«»" +
+      "ÃãĨĩÕõŨũĲĳ¾\ufffd\ufffd‰¶§" +
+      "\ufffd" * 24 +
+      "Δ\ufffdω\ufffd\ufffd\ufffd\ufffd\ufffd" +
+      "αβΓΠΣσµγΦθΩδ∞∅∈∩" +
+      "≡±≥≤\ufffd\ufffd÷\ufffd\ufffd\ufffd\ufffd\ufffdⁿ²",
+    Map('ß' -> 0xE1, '¦' -> 0x7C, 'Ő' -> 0xB4, 'ő' -> 0xB5, 'Ű' -> 0xB6, 'ű' -> 0xB7),
+    Map('♥' -> "\u0001C", '♡' -> "\u0001C", '♢' -> "\u0001D", '♢' -> "\u0001D", '♣' -> "\u0001E", '♠' -> "\u0001F", '·' -> "\u0001G") ,
+    MinimalEscapeSequencesWithBraces ++ Map(
+      "right" -> List(0x1c),
+      "left" -> List(0x1d),
+      "up" -> List(0x1e),
+      "down" -> List(0x1f),
+      "n" -> List(13, 10),
+      "b" -> List(8),
+      "pound" -> List(0x9c),
+      "yen" -> List(0x9d),
+    )
+  )
+
  val MsxRu = new TextCodec("MSX-RU", 0,
    "\ufffd" * 32 +
      (0x20 to 0x7e).map(_.toChar).mkString("") +
@ -617,12 +670,19 @@ object TextCodec {
      "\ufffd" * 8 +
      "Δ\ufffdω\ufffd\ufffd\ufffd\ufffd\ufffd" +
      "αβΓΠΣσµγΦθΩδ∞∅∈∩" +
-      "≡±≥≤\ufffd\ufffd÷\ufffd\ufffd\ufffd\ufffd\ufffdⁿ²\ufffd\ufffd" +
+      "≡±≥≤\ufffd\ufffd÷\ufffd\ufffd\ufffd\ufffd\ufffdⁿ²\ufffd¤" +
      "юабцдефгхийклмнопярстужвьызшэщчъ" +
      "ЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩ",
    Map('ß' -> 0xA1, '¦' -> 0x7C),
    Map('♥' -> "\u0001C", '♡' -> "\u0001C", '♢' -> "\u0001D", '♢' -> "\u0001D", '♣' -> "\u0001E", '♠' -> "\u0001F", '·' -> "\u0001G"),
-    MinimalEscapeSequencesWithBraces + ("n" -> List(13, 10))
+    MinimalEscapeSequencesWithBraces ++ Map(
+      "right" -> List(0x1c),
+      "left" -> List(0x1d),
+      "up" -> List(0x1e),
+      "down" -> List(0x1f),
+      "b" -> List(8),
+      "n" -> List(13, 10)
+    )
  )

  val MsxJp = new TextCodec("MSX-JP", 0,
@ -660,6 +720,11 @@ object TextCodec {
    ) ++
      StandardHiraganaDecompositions ++ StandardKatakanaDecompositions,
    MinimalEscapeSequencesWithBraces ++ Map(
+      "right" -> List(0x1c),
+      "left" -> List(0x1d),
+      "up" -> List(0x1e),
+      "down" -> List(0x1f),
+      "b" -> List(8),
      "n" -> List(13, 10),
      "yen" -> List(0x5c)
    )
@ -668,19 +733,33 @@ object TextCodec {
  val lossyAlternatives: Map[Char, List[String]] = {
    val allowLowercase: Map[Char, List[String]] = ('A' to 'Z').map(c => c -> List(c.toString.toLowerCase(Locale.ROOT))).toMap
    val allowUppercase: Map[Char, List[String]] = ('a' to 'z').map(c => c -> List(c.toString.toUpperCase(Locale.ROOT))).toMap
+    val allowLowercaseCyr: Map[Char, List[String]] = ('а' to 'я').map(c => c -> List(c.toString.toUpperCase(Locale.ROOT))).toMap
+    val allowUppercaseCyr: Map[Char, List[String]] = ('а' to 'я').map(c => c -> List(c.toString.toUpperCase(Locale.ROOT))).toMap
     val ligaturesAndSymbols: Map[Char, List[String]] = Map(
+       // commonly used alternative forms:
       '¦' -> List("|"),
       '|' -> List("¦"),
+       // Eszett:
       'ß' -> List("ss", "SS"),
+       'β' -> List("ß"),
+       // various ligatures:
       'ﬀ' -> List("ff", "FF"),
       'ﬂ' -> List("fl", "FL"),
       'ﬁ' -> List("fi", "FI"),
       'ﬃ' -> List("ffi", "FFI"),
       'ﬄ' -> List("ffl", "FFL"),
+       'ĳ' -> List("ij", "IJ"),
+       'Ĳ' -> List("IJ", "ij"),
+       // fractions:
       '½' -> List("1/2"),
       '¼' -> List("1/4"),
       '¾' -> List("3/4"),
+       // currencies:
+       '₧' -> List("Pt", "PT"),
+       '¢' -> List("c", "C"),
+       '$' -> List("¤"),
       '¥' -> List("Y", "y"),
+       // kanji:
       '円' -> List("¥", "Y", "y"),
       '年' -> List("Y", "y"),
       '月' -> List("M", "m"),
@ -688,11 +767,13 @@ object TextCodec {
       '時' -> List("h", "H"),
       '分' -> List("m", "M"),
       '秒' -> List("s", "S"),
+       // card suits:
       '♥' -> List("H", "h"),
       '♠' -> List("S", "s"),
       '♡' -> List("H", "h"),
       '♢' -> List("D", "d"),
       '♣' -> List("C", "c"),
+       // Eastern punctuation:
       '。' -> List("."),
       '、' -> List(","),
       '・' -> List("-"),
@ -701,33 +782,52 @@ object TextCodec {
       '」' -> List("]", ")"),
       '。' -> List("."),
       '。' -> List("."),
+       // quote marks:
+       '«' -> List("\""),
+       '»' -> List("\""),
+       '‟' -> List("\""),
+       '”' -> List("\""),
+       '„' -> List("\""),
+       '’' -> List("\'"),
+       '‘' -> List("\'"),
+       // pi:
+       'π' -> List("Π"),
+       'Π' -> List("π"),
+       // alternative symbols:
       '^' -> List("↑"),
       '↑' -> List("^"),
       '‾' -> List("~"),
       '¯' -> List("~"),
-       '«' -> List("\""),
-       '»' -> List("\""),
       '§' -> List("#"),
       '[' -> List("("),
       ']' -> List(")"),
       '{' -> List("("),
       '}' -> List(")"),
-       '§' -> List("#"),
-       '§' -> List("#"),
-       '©' -> List("(C)"),
-       'İ' -> List("I", "i"),
+       '©' -> List("(C)", "(c)"),
+       '®' -> List("(R)", "(r)"),
+       '‰' -> List("%."),
+       '×' -> List("x"),
+       '÷' -> List("/"),
       'ª' -> List("a", "A"),
       'º' -> List("o", "O"),
-       '‰' -> List("%."),
-       '÷' -> List("/"),
-       'ĳ' -> List("ij", "IJ"),
-       'Ĳ' -> List("IJ", "ij"),
+       // Turkish I with dot:
+       'İ' -> List("I", "i"),
+       // partially supported Russian letters:
+       'ё' -> List("е", "Ё", "Е"),
+       'Ё' -> List("Е", "ё", "е"),
+       'Ъ' -> List("ъ"),
+       'ъ' -> List("Ъ"),
+       // Latin lookalikes for Cyrillic:
+       'і' -> List("i", "I"),
+       'І' -> List("I", "i"),
+       'ј' -> List("j", "J"),
+       'Ј' -> List("J", "j"),
     )
    val accentedLetters: Map[Char, List[String]] = List(
      "áàäãåąāǎă" -> "a",
      "çčċćĉ" -> "c",
-      "đď" -> "d",
-      "ð" -> "dh",
+      "ðď" -> "d",
+      "đ" -> "dj",
      "éèêëęēėě" -> "e",
      "ğǧĝģġ" -> "g",
      "ħĥ" -> "h",
@ -760,7 +860,7 @@ object TextCodec {
      else fw -> List(hw.toString)
    }.toMap
    val halfWidth = (0xff61 to 0xff9f).map{ c => c.toChar -> List(jisHalfwidthKatakanaOrder(c - 0xff60).toString)}.toMap
-    allowLowercase ++ allowUppercase ++ ligaturesAndSymbols ++ accentedLetters ++ hiragana ++ fullWidth ++ halfWidth
+    allowLowercase ++ allowUppercase ++ allowLowercaseCyr ++ allowUppercaseCyr ++ ligaturesAndSymbols ++ accentedLetters ++ hiragana ++ fullWidth ++ halfWidth
  }

 }