1
0
mirror of https://github.com/KarolS/millfork.git synced 2024-10-31 14:04:58 +00:00
millfork/docs/lang/literals.md

11 KiB
Raw Blame History

< back to index

Literals and initializers

Numeric literals

Decimal: 1, 10

Binary: %0101, 0b101001

Quaternary: 0q2131

Octal: 0o172

Hexadecimal: $D323, 0x2a2

Digits can be separated by underscores for readability. Underscores are also allowed between the radix prefix and the digits:

123_456
0x01_ff
0b_0101_1111__1100_0001
$___________FF

When using Intel syntax for inline assembly, another hexadecimal syntax is available: 0D323H, 2a2h, 3e_h. It is not allowed in any other places.

The type of a literal is the smallest type of undefined signedness that can fit either the unsigned or signed representation of the value: 200 is a byte, 4000 is a word, 75000 is an int24 etc.

However, padding the literal to the left with zeroes changes the type to the smallest type that can fit the smallest number with the same number of digits and without padding. For example, 0002 is of type word, as 1000 does not fit in one byte.

String literals

String literals can be used as either array initializers or expressions of type pointer.

String literals are equivalent to constant arrays. Writing to them via their pointer is undefined behaviour.

If a string literal is used as an expression, then the text data will be located in the default code segment, regardless of which code segment the current function is located it. This may be subject to change in future releases.

String literals are surrounded with double quotes and optionally followed by the name of the encoding:

"this is a string" ascii
"this is also a string"

If there is no encoding name specified, then the default encoding is used. Two encoding names are special and refer to platform-specific encodings: default and scr.

Zero-terminated strings

You can also append z to the name of the encoding to make the string zero-terminated. This means that the string will have a string terminator appended, usually a single byte. The exact value of that byte is encoding-dependent:

  • in the vectrex encoding it's 128,

  • in the zx80 encoding it's 1,

  • in the zx81 encoding it's 11,

  • in the petscr and petscrjp encodings it's 224,

  • in the atascii encoding it's 219,

  • in the utf16be and utf16le encodings it's exceptionally two bytes: 0, 0

  • in other encodings it's 0 (this may be a subject to change in future versions).

      "this is a zero-terminated string" asciiz
      "this is also a zero-terminated string"z
    

The byte constant nullchar is defined to be equal to the string terminator in the default encoding (or, in other words, to '{nullchar}') and the byte constant nullchar_scr is defined to be equal to the string terminator in the scr encoding ('{nullchar}'scr).

You can override the values for nullchar and nullchar_scr by defining preprocessor features NULLCHAR and NULLCHAR_SCR respectively.

Warning: If you define UTF-16 or UTF-32 to be you default or screen encoding, you will encounter several problems:

  • nullchar and nullchar_scr will still be bytes, equal to zero.
  • the string module in the Millfork standard library will not work correctly

Length-prefixed strings (Pascal strings)

You can also prepend p to the name of the encoding to make the string length-prefixed.

The length is measured in bytes and doesn't include the zero terminator, if present. In all encodings except for UTF-16 and UTF-32 the prefix takes one byte, which means that length-prefixed strings cannot be longer than 255 bytes.

In case of UTF-16, the length prefix contains the number of code units, so the number of bytes divided by two, which allows for strings of practically unlimited length. The length is stored as two bytes and is always little endian, even in case of the utf16be encoding or a big-endian processor.

In case of UTF-32, the length prefix contains the number of Unicode codepoints, so the number of bytes divided by four. The length is stored as four bytes and is always little endian, even in case of the utf32be encoding or a big-endian processor.

    "this is a Pascal string" pascii
    "this is also a Pascal string"p
    "this is a zero-terminated Pascal string"pz

Note: A string that's both length-prefixed and zero-terminated does not count as a normal zero-terminated string! To pass it to a function that expects a zero-terminated string, add 1 (or, in case of UTF-16, 2, or UTF-32, 4):

pointer p
p = "test"pz
// putstrz(p)  // won't work correctly
putstrz(p + 1) // ok

Escape sequences and miscellaneous compatibility issues

Most characters between the quotes are interpreted literally. To allow characters that cannot be inserted normally, each encoding may define escape sequences. Every encoding is guaranteed to support at least {q} for double quote and {apos} for single quote/apostrophe.

The number of bytes used to represent given characters may differ from the number of the characters. For example, the petjp, msx_jp and jis encodings represent ポ as two separate characters, and therefore two bytes.

For the list of all text encodings and escape sequences, see this page.

In some encodings, multiple characters are mapped to the same byte value, for compatibility with multiple variants.

If the characters in the literal cannot be encoded in particular encoding, an error is raised. However, if the command-line option -flenient-encoding is used, then literals using default and scr encodings replace unsupported characters with supported ones, skip unsupported escape sequences, and a warning is issued. For example, if -flenient-encoding is enabled, then a literal "£¥↑ž©ß" is equivalent to:

  • "£Y↑z(C)ss" if the default encoding is pet

  • "£Y↑z©ss" if the default encoding is bbc

  • "?Y^z(C)ss" if the default encoding is ascii

  • "?Y^ž(C)ss" if the default encoding is iso_yu

  • "?Y^z(C)ß" if the default encoding is iso_de

  • "?¥^z(C)ss" if the default encoding is jisx

  • "£¥^z(C)β" if the default encoding is msx_intl

Note that the final length of the string may vary.

Character literals

Character literals are surrounded by single quotes and optionally followed by the name of the encoding:

'x' ascii
'W'

Character literals have to be separated from preceding operators with whitespace:

a='a'    // wrong 
a = 'a'  // ok 

From the type system point of view, they are constants of type byte.

If the character cannot be represented as one byte, an error is raised.

For the list of all text encodings and escape sequences, see this page.

If the characters in the literal cannot be encoded in particular encoding, an error is raised. However, if the command-line option -flenient-encoding is used, then literals using default and scr encodings replace unsupported characters with supported ones. If the replacement is one character long, only a warning is issued, otherwise an error is raised.

Struct constructors

You can create a constant of a given struct type by listing constant values of fields as arguments:

struct point { word x, word y }
point(5,6)

Array initializers

An array is initialized with either:

  • (only byte arrays) a string literal

  • (only byte arrays) a file expression

  • a for-style expression

  • (only byte arrays) a format, followed by an array initializer:

    • @word_le: for every term of the array initializer, emit two bytes, first being the low byte of the value, second being the high byte:
      @word_le [$1122] is equivalent to [$22, $11]

    • @word_be like the above, but opposite:
      @word_be [$1122] is equivalent to [$11, $22]

    • @word: equivalent to @word_le on little-endian architectures and @word_be on big-endian architectures

    • @long, @long_le, @long_be: similar, but with four bytes
      @long_le [$11223344] is equivalent to [$44, $33, $22, $11]
      @long_be [$11223344] is equivalent to [$11, $22, $33, $44]

    • @struct: every term of the initializer is interpreted as a struct constructor (see below) and treated as a list of bytes with no padding
      @struct [s(1, 2)] is equivalent to [1, 2] when struct s {byte x, byte y} is defined
      @struct [s2(1, 2), s2(3, 4)] is equivalent to [1, 0, 2, 0, 3, 0, 4, 0] on little-endian machines when struct s2 {word x, word y} is defined

  • a list of literals and/or other array initializers, surrounded by brackets:

      array a = [1, 2]
      array b = "----" scr
      array c = ["hello world!" ascii, 13]
      array d = file("d.bin")
      array d1 = file("d.bin", 128)
      array e = file("d.bin", 128, 256)
      array f = for x,0,until,8 [x * 3 + 5]  // equivalent to [5, 8, 11, 14, 17, 20, 23, 26]
      array(point) g = [point(2,3), point(5,6)]
      array(point) i = for x,0,until,100 [point(x, x+1)]
    

Trailing commas ([1, 2,]) are not allowed.

String literals are laid out in the arrays as-is, flat. To have an array of pointers to strings, wrap each string in pointer(...):

// a.length = 12; identical to [$48, $45, $4C, $4C, $4F, 0, $57, $4F, $52, $4C, $44, 0]
array a = [ "hello"z, "world"z ] 
// b.length = 2
array(pointer) b = [ pointer("hello"z), pointer("world"z) ]

The parameters for file are: file path, optional start offset, optional length (if only two parameters are present, then the second one is assumed to be the start offset). The file expression is expanded at the compile time to an array of bytes equal to the bytes contained in the file. If the start offset is present, then that many bytes at the start of the file are skipped. If the length is present, then only that many bytes are taken, otherwise, all bytes until the end of the file are taken.

The for-style expression has a variable, a starting index, a direction, a final index, and a parameterizable array initializer. The initializer is repeated for every value of the variable in the given range.

Struct constructors look like a function call, where the callee name is the name of the struct type and the parameters are the values of fields in the order of declaration.
Fields of arithmetic, pointer and enum types are declared using normal expressions.
Fields of struct types are declared using struct constructors. Fields of union types cannot be declared.

What might be useful is the fact that the compiler allows for certain built-in functions in constant expressions only:

  • sin(x, n) returns n·sin(xπ/128)

  • cos(x, n) returns n·cos(xπ/128)

  • tan(x, n) returns n·tan(xπ/128)

  • min(x,...) returns the smallest argument

  • max(x,...) returns the largest argument