2 Character sets
Joseph Carter edited this page 2018-06-23 19:23:37 -07:00

From @IvanExpert on October 25, 2015 22:24

UTF-16 is 2-4 byte (not relevant, but just saying) UTF-8 is one byte 0-127, ASCII compatible; 2-6 bytes for everything else this screws up Apple II term programs for non-ASCII chars (e.g. hyphen, smart quote)

ISO-8859-* is one byte 0-255, with 128-255 variying by "part" 1-16 ISO-8859-1 is "Latin-1", revision is ISO-8859-15, others are langauge-specific Apple II text comm programs are going to display 0-127 anyway, since Apple II 128-255 are redundant or MouseText "ANSI" in a comm program means pseudo VT-100, and may also mean the "DOS CodePage 437" (IBM PC character set), as is the case with Spectrum ANSI emulation So it doesn't matter which ISO-8859 part, since the comm programs aren't going to use any of them. The main thing is that it's one byte per character, unlike UTF-8 TERM=vt100 on Pi makes Linux programs mostly display B&W, and makes ctrl-chars display on Spectrum ANSI TERM=pcansi on Pi makes Linux programs do color for Spectrum ANSI (TERM=ansi just breaks everything) LANG=en_US (as opposed to en_US.UTF-8) gets you ISO-8859-1, which is better for Spectrum ANSI, but the en_US ISO-8859-1 locale has to be available (from raspi-config) See A2CLOUD setup for how to generate locales from Linux prompt ProTERM VT-100 just repeats 128-255; ANSI BBS uses ASCII and mousetext to approximate DOS Code Page 437 Spectrum VT-100 is sort of arbitrary in 128-255 TERM=VT100 doesn't work with "ANSI" emulation because it outputs ctrl-O around text styling which is a displayed character in CP437

single-byte: ASCII is single byte 0-127 (0-31 are "C0" control codes, plus 127 is DEL) ISO-8859-* (1-16) is ASCII for 0-127, 128-159 are "C1" control codes, 160-255 are regional characters ISO-8859-1 is standard "Latin-1", ISO-8859-15 is updated for Euro and other chars

Microsoft has its own "codepage" numbers for character sets. Codepage 437 (aka "ANSI BBS") is the DOS character set: ASCII from 32-126, plus printable chars at 1-31 and 127-255; (all chars are also represented in UTF-8) "Linedraw" font for Windows provides characters 128+ for codepage 437: ftp://ftp.microsoft.com/Softlib/MSLFILES/GC0651.EXE (use 64.4.17.176 if doesn't resolve) Also "Terminal" font in XP provides most of it; Courier New is a Unicode font with most of the same characters Windows-1252 (codepage 1252) is ISO-8859-1 with additional chars from 128-159 instead of C1, including all chars in ISO-8859-15 Mac has "macintosh" or "MacRoman" encoding which is ASCII for 0-127 and its own characters for 128-255

UTF-8 characters 0-127 is same as ASCII UTF-8 characters 128+ are between two and four bytes and can represent everything (I guess) UTF-16 characters are between two and four bytes, and are endian-sensitive UTF-32 characters are always four bytes, and are endian-sensitive

iKarith adds 2018-06-23:

We've discussed that this setup is less than ideal and in fact the current behavior doesn't match what is discussed here. Instead, LANG=C is happening by default due to an error. This is probably for the best for the serial login and ought to become the explicit default.

Moreover, IBM CP437 is supported by Spectrum at least (with a bug that may or may not be fixed at the time of writing—really should check on that), and we don't generate a locale with that character set at all. We ought to.