Commit Graph

172 Commits

Author SHA1 Message Date
Stephen Heumann
dc305a86b2 Add flag to suppress printing of put-back tokens with #pragma expand.
This is currently used in a couple places in the designated initializer code (solving the problem with #pragma expand in the last commit). It could probably be used elsewhere too, but for now it is not.
2022-11-28 21:22:56 -06:00
Stephen Heumann
b6d3dfb075 Designated initializers for arrays, part 1.
This can parse designated initializers for arrays, but does not create proper initializer records for them.
2022-11-26 15:22:58 -06:00
Stephen Heumann
3f450bdb80 Support "inline" function definitions without static or extern.
This is a minimal implementation that does not actually inline anything, but it is intended to implement the semantics defined by the C99 and later standards.

One complication is that a declaration that appears somewhere after the function body may create an external definition for a function that appeared to be an inline definition when it was defined. To support this while preserving ORCA/C's general one-pass compilation strategy, we generate code even for inline definitions, but treat them as private and add the prefix "~inline~" to the name. If they are "un-inlined" based on a later declaration, we generate a stub with external linkage that just jumps to the apparently-inline function.
2022-11-19 23:04:22 -06:00
Stephen Heumann
ab368d442a Allow \ as an "other character" preprocessing token.
This still has a few issues. A \ token may not be followed by u or U (because this triggers UCN processing). We should scan through the whole possible UCN until we can confirm whether it is actually a UCN, but that would require more lookahead. Also, \ is not handled correctly in stringization (it should form escape sequences).
2022-11-08 20:46:48 -06:00
Stephen Heumann
9cc72c8845 Support "other character" preprocessing tokens.
This implements the catch-all category for preprocessing tokens for "each non-white-space character that cannot be one of the above" (C17 section 6.4). These may appear in skipped code, or in macros or macro parameters if they are never expanded or are stringized during macro processing. The affected characters are $, @, `, and many extended characters.

It is still an error if these tokens are used in contexts where they remain present after preprocessing. If #pragma ignore bit 0 is clear, these characters are also reported as errors in skipped code or preprocessor constructs.
2022-11-08 18:58:50 -06:00
Stephen Heumann
f31b5ea1e6 Allow "extern inline" functions.
A function declared "inline" with an explicit "extern" storage class has the same semantics as if "inline" was omitted. (It is not an inline definition as defined in the C standards.) The "inline" specifier suggests that the function should be inlined, but it is legal to just ignore it, as we already do for "static inline" functions.

Also add a test for the inline function specifier.
2022-10-29 19:43:57 -05:00
Stephen Heumann
f54d0e1854 Require that main have no function specifiers.
This enforces a constraint in the C standards (for a hosted environment).
2022-10-29 18:36:51 -05:00
Stephen Heumann
e5428b21d2 Do not skip over the character after _Pragma(...). 2022-10-29 15:55:44 -05:00
Stephen Heumann
4702df9aac Support Unicode strings and some escape sequences in _Pragma.
This still works by "reconstructing" the string literal text, rather than just using what was in the source code. This is not what the standards specify and can result in slightly different behavior in some corner cases, but for realistic cases it is probably fine.
2022-10-25 22:47:22 -05:00
Stephen Heumann
e63d827049 Do not do macro expansion on preprocessor directive names.
According to the C standards (C17 section 6.10.3 p8), they should not be subject to macro replacement.

A similar change also applies to the "STDC" in #pragma STDC ... (but we still allow macros for other pragmas, which is allowed as part of the implementation-defined behavior of #pragma).

Here is an example affected by this issue:

#define ifdef ifndef
#ifdef foobar
#error "foobar defined?"
#else
int main(void) {}
#endif
2022-10-25 22:40:20 -05:00
Stephen Heumann
e0b27db652 Do not try to interpret non-identifier tokens as pragma names.
This could access arbitrary memory locations, and could theoretically cause misbehavior including falsely recognizing the token as a pragma or accessing a softswitch/IO location.
2022-10-25 22:26:30 -05:00
Stephen Heumann
81353a9f8a Always interpret the digit sequence in #line as decimal.
This is what the standards call for.
2022-10-23 13:47:59 -05:00
Stephen Heumann
65ec29ee3e Use 32-bit representation for line numbers.
C99 and later specify that line numbers set via #line can be up to 2147483647, so they need to be represented as (at least) a 32-bit value.
2022-10-22 21:46:12 -05:00
Stephen Heumann
760c932fea Initial implementation of _Pragma (C99).
This works for typical cases, but does not yet handle Unicode strings, very long strings, or certain escape sequences.
2022-10-22 17:08:54 -05:00
Stephen Heumann
946c6c1d55 Always end preprocessor expression processing at end of line.
In certain error cases, tokens from subsequent lines could get treated as part of a preprocessor expression, causing subsequent code to be essentially ignored and producing strange error messages.

Here is an example (with an error) affected by this:

#pragma optimize 0 0
int main(void) {}
2022-10-21 18:51:53 -05:00
Stephen Heumann
bdf212ec6b Remove support for separate . . . as equivalent to a ... token.
The scanner has been updated so that ... should always get recognized as a single token, so this is no longer necessary as a workaround. Any code that actually uses separate . . .  is non-standard and will need to be changed.
2022-10-19 18:14:14 -05:00
Stephen Heumann
6d8ca42734 Parse the _Thread_local storage-class specifier.
This does not really do anything, because ORCA/C does not support multithreading, but the C11 and later standards indicate it should be allowed anyway.
2022-10-18 21:01:26 -05:00
Stephen Heumann
afe40c0f67 Prevent spurious errors about structs containing function pointers.
If a struct contained a function pointer with a prototyped parameter list, processing the parameters could reset the declaredTagOrEnumConst flag, potentially leading to a spurious error, as in this example:

struct S {
	int (*f)(int);
};

This also gives a better error for structs declared as containing functions.
2022-10-16 19:57:14 -05:00
Stephen Heumann
a864954353 Use "declarator expected" error messages when appropriate.
Previously, some of these cases would report "identifier expected."
2022-10-16 18:45:06 -05:00
Stephen Heumann
44a1ba5205 Print floating constants with more precision in #pragma expand output.
Finite numbers should now be printed with sufficient precision to produce the same value as the original constant in the relevant type.
2022-10-15 22:20:22 -05:00
Stephen Heumann
83ac0ecebf Add a function to peek at the next character.
This is necessary to correctly handle line continuations in a few places:
* Between an initial . and the subsequent digit in a floating constant
* Between the third and fourth characters of a %:%: digraph
* Between the second and third dots of a ... token

Previously, these would not be tokenized correctly, leading to spurious errors in the first and second cases above.

Here is a sample program illustrating the problem:

int printf(const char * restrict, ..\
\
??/
.);
int main(void) {
        double d = .??/
\
??/
\
1234;
        printf("%f\n", d);
}
2022-10-15 21:42:02 -05:00
Stephen Heumann
b8b7dc2c2b Remove code that treats # as an illegal character in most places.
C90 had constraints requiring # and ## tokens to only appear in preprocessing directives, but C99 and later removed those constraints, so this code is no longer necessary when targeting current languages versions. (It would be necessary in a "strict C90" mode, if that was ever implemented.)

The main practical effect of this is that # and ## tokens can be passed as parameters to macros, provided the macro either ignores or stringizes that parameter. # and ## tokens still have no role in the grammar of the C language after preprocessing, so they will be an unexpected token and produce some kind of error if they appear anywhere.

This also contains a change to ensure that a line containing one or more illegal characters (e.g. $) and then a # is not treated as a preprocessing directive.
2022-10-13 18:35:26 -05:00
Stephen Heumann
4fe9c90942 Parse ... as a single punctuator token.
This accords with its definition in the C standards. For the time being, the old form of three separate tokens is still accepted too, because the ... token may not be scanned correctly in the obscure case where there is a line continuation between the second and third dots.

One observable effect of this is that there are no longer spaces between the dots in #pragma expand output.
2022-10-10 18:06:01 -05:00
Stephen Heumann
f263066f61 Give an error for declarations that do not declare anything.
This enforces the constraint from C17 section 6.7 p2 that declarations "shall declare at least a declarator (other than the parameters of a function or the members of a structure or union), a tag, or the members of an enumeration."

Somewhat relaxed rules are used for enums in the default loose type checking mode, similar to what GCC and Clang do.
2022-10-09 22:03:06 -05:00
Stephen Heumann
05ecf5eef3 Add option to use the declared type for float/double/comp params.
This differs from the usual ORCA/C behavior of treating all floating-point parameters as extended. With the option enabled, they will still be passed in the extended format, but will be converted to their declared type at the start of the function. This is needed for strict standards conformance, because you should be able to take the address of a parameter and get a usable pointer to its declared type. The difference in types can also affect the behavior of _Generic expressions.

The implementation of this is based on ORCA/Pascal, which already did the same thing (unconditionally) with real/double/comp parameters.
2022-09-18 21:16:46 -05:00
Stephen Heumann
95ad02f0b9 Detect various errors in macro definitions.
These changes detect violations of several constraints in C17 section 6.10.3 and subsections.
2022-07-28 20:49:22 -05:00
Stephen Heumann
6e3fca8b82 Implement strict type checking for enum types.
If strict type checking is enabled, this will prohibit redefinition of enums, like:

enum E {a,b,c};
enum E {x,y,z};

It also prohibits use of an "enum E" type specifier if the enum has not been previously declared (with its constants).

These things were historically supported by ORCA/C, but they are prohibited by constraints in section 6.7.2.3 of C99 and later. (The C90 wording was different and less clear, but I think they were not intended to be valid there either.)
2022-07-19 20:35:44 -05:00
Stephen Heumann
8406921147 Parse command-line macros more consistently with macros in code.
This makes a macro defined on the command line like -Dfoo=-1 consist of two tokens, the same as it would if defined in code. (Previously, it was just one token.)

This also somewhat expands the set of macros accepted on the command line. A prefix of +, -, *, &, ~, or ! (the one-character unary operators) can now be used ahead of any identifier, number, or string. Empty macro definitions like -Dfoo= are also permitted.
2022-06-15 21:52:35 -05:00
Stephen Heumann
3c2b492618 Add support for compound literals within functions.
The basic approach is to generate a single expression tree containing the code for the initialization plus the reference to the compound literal (or its address). The various subexpressions are joined together with pc_bno pcodes, similar to the code generated for the comma operator. The initializer expressions are placed in a balanced binary tree, so that it is not excessively deep.

Note: Common subexpression elimination has poor performance for very large trees. This is not specific to compound literals, but compound literals for relatively large arrays can run into this issue. It will eventually complete and generate a correct program, but it may be quite slow. To avoid this, turn off CSE.
2022-06-08 21:34:12 -05:00
Stephen Heumann
58771ec71c Do not do macro expansion after each ## operator is evaluated.
It should only be done after all the ## operators in the macro have been evaluated, potentially merging together several tokens via successive ## operators.

Here is an example illustrating the problem:

#define merge(a,b,c) a##b##c
#define foobar
#define foobarbaz a
int merge(foo,bar,baz) = 42;
int main(void) {
        return a;
}
2022-05-24 22:38:56 -05:00
Stephen Heumann
deca73d233 Properly expand macros that have the same name as a keyword or typedef.
If such macros were used within other macros, they would generally not be expanded, due to the order in which operations were evaluated during preprocessing.

This is actually an issue that was fixed by the changes from ORCA/C 2.1.0 to 2.1.1 B3, but then broken again by commit d0b4b75970.

Here is an example with the name of a keyword:

#define X long int
#define long
X x;
int main(void) {
        return sizeof(x); /* should be sizeof(int) */
}

Here is an example with the name of a typedef:

typedef short T;
#define T long
#define X T
X x;
int main(void) {
        return sizeof(x); /* should be sizeof(long) */
}
2022-05-24 22:22:37 -05:00
Stephen Heumann
21f266c5df Require use of digraphs in macro redefinitions to match the original.
This is part of the general requirement that macro redefinitions be "identical" as defined in the standard.

This affects code like:

#define x [
#define x <:
2022-04-05 19:47:22 -05:00
Stephen Heumann
a1d57c4db3 Allow ORCA/C-specific keywords to be disabled via a new pragma.
This allows those tokens (asm, comp, extended, pascal, and segment) to be used as identifiers, consistent with the C standards.

A new pragma (#pragma extensions) is introduced to control this. It might also be used for other things in the future.
2022-03-26 18:45:47 -05:00
Stephen Heumann
b2edeb4ad1 Properly stringize tokens that start with a trigraph.
This did not work correctly before, because such tokens were recorded as starting with the third character of the trigraph.

Here is an example affected by this:

#define mkstr(a) # a
#include <stdio.h>
int main(void) {
        puts(mkstr(??!));
        puts(mkstr(??!??!));
        puts(mkstr('??<'));
        puts(mkstr(+??!));
        puts(mkstr(+??'));
}
2022-03-25 18:10:13 -05:00
Stephen Heumann
f531f38463 Use suffixes on numeric constants in #pragma expand output.
A suffix will now be printed on any integer constant with a type other than int, or any floating constant with a type other than double. This ensures that all constants have the correct types, and also serves as documentation of the types.
2022-03-01 19:46:14 -06:00
Stephen Heumann
182cf66754 Properly stringize tokens with line continuations or non-initial trigraphs.
Previously, continuations or trigraphs would be included in the string as-is, which should not be the case because they are (conceptually) processed in earlier compilation phases. Initial trigraphs still do not get stringized properly, because the token starting position is not recorded correctly for them.

This fixes code like the following:

#define mkstr(a) # a
#include <stdio.h>
int main(void) {
        puts(mkstr(a\
bc));
        puts(mkstr(qr\
));
        puts(mkstr(\
xy));
        puts(mkstr(12??/
34));
        puts(mkstr('??<'));
}
2022-03-01 19:01:11 -06:00
Stephen Heumann
fec7b57ec2 Generate a string representation of tokens merged with ##.
This is necessary for correct behavior if such tokens are subsequently stringized with #. Previously, only the first half of the token would be produced.

Here is an example demonstrating the issue:

#define mkstr(a) # a
#define in_between(a) mkstr(a)
#define joinstr(a,b) in_between(a ## b)
#include <stdio.h>
int main(void) {
        puts(joinstr(123,456));
        puts(joinstr(abc,def));
        puts(joinstr(dou,ble));
        puts(joinstr(+,=));
        puts(joinstr(:,>));
}
2022-02-22 18:48:34 -06:00
Stephen Heumann
6cfe8cc886 Remove an unused string representation of macro tokens.
The string representation of macro tokens is needed for some preprocessor operations, but we get this in other ways (e.g. based on tokenStart/tokenEnd).
2022-02-21 18:39:39 -06:00
Stephen Heumann
8f27b8abdb Print any ## tokens in #pragma expand output.
Note that ## will not currently be recognized as a token in some contexts, leading to it not being printed.
2022-02-20 20:53:37 -06:00
Stephen Heumann
bf7a6fa5db Use separate functions for merging tokens with ## and merging adjacent strings.
These are conceptually separate operations occurring in different phases of the translation process. This change means that ## can no longer merge string constants: such operations will give an error about an illegal token. Cases like this are technically undefined behavior, so the old behavior could have been permitted, but it is clearer and more consistent with other compilers to treat this as an error.
2022-02-20 20:16:08 -06:00
Stephen Heumann
26e1bfc253 Allow generation of digraphs via ## token merging. 2022-02-20 18:57:03 -06:00
Stephen Heumann
2b062a8392 Make ## token merging on character constants give an error.
This ultimately should be supported, but that will be more work. For now, we just set the string representation to '?', which will usually give an error when merged. (Previously, whatever was at memory location 0 would be treated as the string representation of the token. Frequently this would just be an empty string, leading to no error but incorrect results.)
2022-02-20 16:19:00 -06:00
Stephen Heumann
da978932bf Save string representation of macros defined on command line.
This is necessary for correct operation of the # and ## preprocessor operators on the tokens from such macros.

Integers with a sign character still have the non-standard property of being treated as a single token, so they cannot be used with ##, but in most cases such uses will now give an error.
2022-02-20 15:35:49 -06:00
Stephen Heumann
aabbadb34b Terminate header generation if #warning is encountered.
This is necessary to ensure that the warning message is printed on subsequent compiles.
2022-02-19 14:06:15 -06:00
Stephen Heumann
a73dce103b Terminate PCH generation if an #append is encountered.
If the appended file was another C file and that file contained an #include, this would create an invalid record in the sym file. It would record memory from the buffer holding the original file to the buffer holding the appended file. In general, these are not contiguous, so superfluous data from other parts of memory would be included in the sym file. This record would normally just be treated as invalid on subsequent compiles, but it could theoretically be very large (depending on the memory layout) and might contain sensitive data from other parts of memory.
2022-02-19 14:05:07 -06:00
Stephen Heumann
f2d6625300 Save #pragma path directives in sym files.
They were not being saved, which would result in ORCA/C not searching the proper paths when looking for an include file after the sym file had ended. Here is an example showing the problem:

#pragma path "include"
#include <stdio.h>
int k = 50;
#include "n.h" /* will not find include:n.h */
2022-02-15 21:27:35 -06:00
Stephen Heumann
3893db1346 Make sure #pragma expand is properly applied in all cases.
There were various places where the flag for macro expansions was saved, set to false, and then later restored. If #pragma expand was used within those areas, it would not be properly applied. Here is an example showing that problem:

void f(void
#pragma expand 1
) {}

This could also affect some uses of #pragma expand within precompiled headers, e.g.:

#pragma expand 1
#include "a.h"
#undef foobar
#include "b.h"
...

Also, add a note saying that code in precompiled headers will not be expanded. (This has always been the case, but was not clearly documented.)
2022-02-15 20:50:02 -06:00
Stephen Heumann
c96cf4f1dd Do not save predefined and command-line macros in the sym file.
Previously, these might or might not be saved (based on the contents of uninitialized memory), but in many cases they were. This was unnecessary, since these macros are automatically defined when the scanner is initialized. Reading them from the sym file could result in duplicate copies of them in the macro list. This is usually harmless, but might result in #undefs of macros from the command line not working properly.
2022-02-13 20:17:33 -06:00
Stephen Heumann
b493dcb1da Add lint check to require whitespace after names of object-like macros.
This is a requirement added in C99, so it is added as part of the C99 syntax checks.

This affects definitions like:

#define foo;
2022-02-13 19:44:56 -06:00
Stephen Heumann
c169c2bf92 Fully prohibit redefinition of predefined macros.
Code like the following was previously being allowed:

#define __STDC__ /* no tokens */
2022-02-13 18:10:45 -06:00