Commit Graph

182 Commits

Author SHA1 Message Date
Stephen Heumann a6ef872513 Add debugging option to detect illegal use of null pointers.
This adds debugging code to detect null pointer dereferences, as well as pointer arithmetic on null pointers (which is also undefined behavior, and can lead to later dereferences of the resulting pointers).

Note that ORCA/Pascal can already detect null pointer dereferences as part of its more general range-checking code. This implementation for ORCA/C will report the same error as ORCA/Pascal ("Subrange exceeded"). However, it does not include any of the other forms of range checking that ORCA/Pascal does, and (unlike in ORCA/Pascal) it is controlled by a separate flag from stack overflow checking.
2023-02-12 18:56:02 -06:00
Stephen Heumann 03fc7a43b9 Give an error for expressions with incomplete struct/union types.
These are erroneous, in situations where the expression is used for its value. For function return types, this violates a constraint (C17 6.5.2.2 p1), so a diagnostic is required. We also now diagnose this issue for identifier expressions or unary * (indirection) expressions. These cases cause undefined behavior per C17 6.3.2.1 p2, so a diagnostic is not required, but it is nice to give one.
2023-01-09 21:58:53 -06:00
Stephen Heumann 61a2cd1e5e Consistently report "compiler error" for unrecognized error codes.
The old approach of calling Error while in the middle of writing error messages did not work reliably.
2023-01-09 18:46:33 -06:00
Stephen Heumann 245dd0a3f4 Add lint check for implicit conversions that change a constant's value.
This occurs when the constant value is out of range of the type being assigned to. This is likely indicative of an error, or of code that assumes types have larger ranges than they do in ORCA/C (e.g. 32-bit int).

This intentionally does not report cases where a value is assigned to a signed type but is within the range of the corresponding unsigned type, or vice versa. These may be done intentionally, e.g. setting an unsigned value to "-1" or setting a signed value using a hex constant with the high bit set. Also, only conversions to 8-bit or 16-bit integer types are currently checked.
2023-01-03 18:57:32 -06:00
Stephen Heumann fe62f70d51 Add lint option to check for unused variables. 2022-12-12 21:47:32 -06:00
Stephen Heumann 32975b720f Allow native code peephole opt to be used when stack repair is enabled.
I think the reason this was originally disallowed is that the old code sequence for stack repair code (in ORCA/C 2.1.0) ended with TYA. If this was followed by STA dp or STA abs, the native code peephole optimizer (prior to commit 7364e2d2d3) would have turned the combination into a STY instruction. That is invalid if the value in A is needed. This could come up, e.g., when assigning the return value from a function to two different variables.

This is no longer an issue, because the current code sequence for stack repair code no longer ends in TYA and is not susceptible to the same kind of invalid optimization. So it is no longer necessary to disable the native code peephole optimizer when using stack repair code (either for all calls or just varargs calls).
2022-12-10 20:34:00 -06:00
Stephen Heumann e71fe5d785 Treat unary + as an actual operator, not a no-op.
This is necessary both to detect errors (using unary + on non-arithmetic types) and to correctly perform the integer promotions when unary + is used (which can be detected with sizeof or _Generic).
2022-12-09 19:03:38 -06:00
Stephen Heumann bb1bd176f4 Add a command-line option to select the C standard to use.
This provides a more straightforward way to place the compiler in a "strict conformance" mode. This could essentially be achieved by setting several pragma options, but having a single setting is simpler. "Compatibility modes" for older standards can also be selected, although these actually continue to enable most C17 features (since they are unlikely to cause compatibility problems for older code).
2022-12-07 21:35:15 -06:00
Stephen Heumann 6857913daa Make the object buffer dynamically resizable.
It will now grow as needed to accommodate large segments, subject to the constraints of available memory. In practice, this mostly affects the size of initialized static arrays that can be used.

This also removes any limit apart from memory size on how large the object representation produced by a "compile to memory" can be, and cleans up error reporting regarding size limits.
2022-12-06 21:49:20 -06:00
Stephen Heumann c06d78bb5e Add __STDC_VERSION__ macro.
With the addition of designated initializers, ORCA/C now supports all the major mandatory language features added between C90 and C17, apart from those made optional by C11. There are still various small areas of nonconformance and a number of missing library functions, but at this point it is reasonable for ORCA/C to report itself as being a C17 implementation.
2022-12-04 22:25:02 -06:00
Stephen Heumann dc305a86b2 Add flag to suppress printing of put-back tokens with #pragma expand.
This is currently used in a couple places in the designated initializer code (solving the problem with #pragma expand in the last commit). It could probably be used elsewhere too, but for now it is not.
2022-11-28 21:22:56 -06:00
Stephen Heumann b6d3dfb075 Designated initializers for arrays, part 1.
This can parse designated initializers for arrays, but does not create proper initializer records for them.
2022-11-26 15:22:58 -06:00
Stephen Heumann 3f450bdb80 Support "inline" function definitions without static or extern.
This is a minimal implementation that does not actually inline anything, but it is intended to implement the semantics defined by the C99 and later standards.

One complication is that a declaration that appears somewhere after the function body may create an external definition for a function that appeared to be an inline definition when it was defined. To support this while preserving ORCA/C's general one-pass compilation strategy, we generate code even for inline definitions, but treat them as private and add the prefix "~inline~" to the name. If they are "un-inlined" based on a later declaration, we generate a stub with external linkage that just jumps to the apparently-inline function.
2022-11-19 23:04:22 -06:00
Stephen Heumann ab368d442a Allow \ as an "other character" preprocessing token.
This still has a few issues. A \ token may not be followed by u or U (because this triggers UCN processing). We should scan through the whole possible UCN until we can confirm whether it is actually a UCN, but that would require more lookahead. Also, \ is not handled correctly in stringization (it should form escape sequences).
2022-11-08 20:46:48 -06:00
Stephen Heumann 9cc72c8845 Support "other character" preprocessing tokens.
This implements the catch-all category for preprocessing tokens for "each non-white-space character that cannot be one of the above" (C17 section 6.4). These may appear in skipped code, or in macros or macro parameters if they are never expanded or are stringized during macro processing. The affected characters are $, @, `, and many extended characters.

It is still an error if these tokens are used in contexts where they remain present after preprocessing. If #pragma ignore bit 0 is clear, these characters are also reported as errors in skipped code or preprocessor constructs.
2022-11-08 18:58:50 -06:00
Stephen Heumann f31b5ea1e6 Allow "extern inline" functions.
A function declared "inline" with an explicit "extern" storage class has the same semantics as if "inline" was omitted. (It is not an inline definition as defined in the C standards.) The "inline" specifier suggests that the function should be inlined, but it is legal to just ignore it, as we already do for "static inline" functions.

Also add a test for the inline function specifier.
2022-10-29 19:43:57 -05:00
Stephen Heumann f54d0e1854 Require that main have no function specifiers.
This enforces a constraint in the C standards (for a hosted environment).
2022-10-29 18:36:51 -05:00
Stephen Heumann e5428b21d2 Do not skip over the character after _Pragma(...). 2022-10-29 15:55:44 -05:00
Stephen Heumann 4702df9aac Support Unicode strings and some escape sequences in _Pragma.
This still works by "reconstructing" the string literal text, rather than just using what was in the source code. This is not what the standards specify and can result in slightly different behavior in some corner cases, but for realistic cases it is probably fine.
2022-10-25 22:47:22 -05:00
Stephen Heumann e63d827049 Do not do macro expansion on preprocessor directive names.
According to the C standards (C17 section 6.10.3 p8), they should not be subject to macro replacement.

A similar change also applies to the "STDC" in #pragma STDC ... (but we still allow macros for other pragmas, which is allowed as part of the implementation-defined behavior of #pragma).

Here is an example affected by this issue:

#define ifdef ifndef
#ifdef foobar
#error "foobar defined?"
#else
int main(void) {}
#endif
2022-10-25 22:40:20 -05:00
Stephen Heumann e0b27db652 Do not try to interpret non-identifier tokens as pragma names.
This could access arbitrary memory locations, and could theoretically cause misbehavior including falsely recognizing the token as a pragma or accessing a softswitch/IO location.
2022-10-25 22:26:30 -05:00
Stephen Heumann 81353a9f8a Always interpret the digit sequence in #line as decimal.
This is what the standards call for.
2022-10-23 13:47:59 -05:00
Stephen Heumann 65ec29ee3e Use 32-bit representation for line numbers.
C99 and later specify that line numbers set via #line can be up to 2147483647, so they need to be represented as (at least) a 32-bit value.
2022-10-22 21:46:12 -05:00
Stephen Heumann 760c932fea Initial implementation of _Pragma (C99).
This works for typical cases, but does not yet handle Unicode strings, very long strings, or certain escape sequences.
2022-10-22 17:08:54 -05:00
Stephen Heumann 946c6c1d55 Always end preprocessor expression processing at end of line.
In certain error cases, tokens from subsequent lines could get treated as part of a preprocessor expression, causing subsequent code to be essentially ignored and producing strange error messages.

Here is an example (with an error) affected by this:

#pragma optimize 0 0
int main(void) {}
2022-10-21 18:51:53 -05:00
Stephen Heumann bdf212ec6b Remove support for separate . . . as equivalent to a ... token.
The scanner has been updated so that ... should always get recognized as a single token, so this is no longer necessary as a workaround. Any code that actually uses separate . . .  is non-standard and will need to be changed.
2022-10-19 18:14:14 -05:00
Stephen Heumann 6d8ca42734 Parse the _Thread_local storage-class specifier.
This does not really do anything, because ORCA/C does not support multithreading, but the C11 and later standards indicate it should be allowed anyway.
2022-10-18 21:01:26 -05:00
Stephen Heumann afe40c0f67 Prevent spurious errors about structs containing function pointers.
If a struct contained a function pointer with a prototyped parameter list, processing the parameters could reset the declaredTagOrEnumConst flag, potentially leading to a spurious error, as in this example:

struct S {
	int (*f)(int);
};

This also gives a better error for structs declared as containing functions.
2022-10-16 19:57:14 -05:00
Stephen Heumann a864954353 Use "declarator expected" error messages when appropriate.
Previously, some of these cases would report "identifier expected."
2022-10-16 18:45:06 -05:00
Stephen Heumann 44a1ba5205 Print floating constants with more precision in #pragma expand output.
Finite numbers should now be printed with sufficient precision to produce the same value as the original constant in the relevant type.
2022-10-15 22:20:22 -05:00
Stephen Heumann 83ac0ecebf Add a function to peek at the next character.
This is necessary to correctly handle line continuations in a few places:
* Between an initial . and the subsequent digit in a floating constant
* Between the third and fourth characters of a %:%: digraph
* Between the second and third dots of a ... token

Previously, these would not be tokenized correctly, leading to spurious errors in the first and second cases above.

Here is a sample program illustrating the problem:

int printf(const char * restrict, ..\
\
??/
.);
int main(void) {
        double d = .??/
\
??/
\
1234;
        printf("%f\n", d);
}
2022-10-15 21:42:02 -05:00
Stephen Heumann b8b7dc2c2b Remove code that treats # as an illegal character in most places.
C90 had constraints requiring # and ## tokens to only appear in preprocessing directives, but C99 and later removed those constraints, so this code is no longer necessary when targeting current languages versions. (It would be necessary in a "strict C90" mode, if that was ever implemented.)

The main practical effect of this is that # and ## tokens can be passed as parameters to macros, provided the macro either ignores or stringizes that parameter. # and ## tokens still have no role in the grammar of the C language after preprocessing, so they will be an unexpected token and produce some kind of error if they appear anywhere.

This also contains a change to ensure that a line containing one or more illegal characters (e.g. $) and then a # is not treated as a preprocessing directive.
2022-10-13 18:35:26 -05:00
Stephen Heumann 4fe9c90942 Parse ... as a single punctuator token.
This accords with its definition in the C standards. For the time being, the old form of three separate tokens is still accepted too, because the ... token may not be scanned correctly in the obscure case where there is a line continuation between the second and third dots.

One observable effect of this is that there are no longer spaces between the dots in #pragma expand output.
2022-10-10 18:06:01 -05:00
Stephen Heumann f263066f61 Give an error for declarations that do not declare anything.
This enforces the constraint from C17 section 6.7 p2 that declarations "shall declare at least a declarator (other than the parameters of a function or the members of a structure or union), a tag, or the members of an enumeration."

Somewhat relaxed rules are used for enums in the default loose type checking mode, similar to what GCC and Clang do.
2022-10-09 22:03:06 -05:00
Stephen Heumann 05ecf5eef3 Add option to use the declared type for float/double/comp params.
This differs from the usual ORCA/C behavior of treating all floating-point parameters as extended. With the option enabled, they will still be passed in the extended format, but will be converted to their declared type at the start of the function. This is needed for strict standards conformance, because you should be able to take the address of a parameter and get a usable pointer to its declared type. The difference in types can also affect the behavior of _Generic expressions.

The implementation of this is based on ORCA/Pascal, which already did the same thing (unconditionally) with real/double/comp parameters.
2022-09-18 21:16:46 -05:00
Stephen Heumann 95ad02f0b9 Detect various errors in macro definitions.
These changes detect violations of several constraints in C17 section 6.10.3 and subsections.
2022-07-28 20:49:22 -05:00
Stephen Heumann 6e3fca8b82 Implement strict type checking for enum types.
If strict type checking is enabled, this will prohibit redefinition of enums, like:

enum E {a,b,c};
enum E {x,y,z};

It also prohibits use of an "enum E" type specifier if the enum has not been previously declared (with its constants).

These things were historically supported by ORCA/C, but they are prohibited by constraints in section 6.7.2.3 of C99 and later. (The C90 wording was different and less clear, but I think they were not intended to be valid there either.)
2022-07-19 20:35:44 -05:00
Stephen Heumann 8406921147 Parse command-line macros more consistently with macros in code.
This makes a macro defined on the command line like -Dfoo=-1 consist of two tokens, the same as it would if defined in code. (Previously, it was just one token.)

This also somewhat expands the set of macros accepted on the command line. A prefix of +, -, *, &, ~, or ! (the one-character unary operators) can now be used ahead of any identifier, number, or string. Empty macro definitions like -Dfoo= are also permitted.
2022-06-15 21:52:35 -05:00
Stephen Heumann 3c2b492618 Add support for compound literals within functions.
The basic approach is to generate a single expression tree containing the code for the initialization plus the reference to the compound literal (or its address). The various subexpressions are joined together with pc_bno pcodes, similar to the code generated for the comma operator. The initializer expressions are placed in a balanced binary tree, so that it is not excessively deep.

Note: Common subexpression elimination has poor performance for very large trees. This is not specific to compound literals, but compound literals for relatively large arrays can run into this issue. It will eventually complete and generate a correct program, but it may be quite slow. To avoid this, turn off CSE.
2022-06-08 21:34:12 -05:00
Stephen Heumann 58771ec71c Do not do macro expansion after each ## operator is evaluated.
It should only be done after all the ## operators in the macro have been evaluated, potentially merging together several tokens via successive ## operators.

Here is an example illustrating the problem:

#define merge(a,b,c) a##b##c
#define foobar
#define foobarbaz a
int merge(foo,bar,baz) = 42;
int main(void) {
        return a;
}
2022-05-24 22:38:56 -05:00
Stephen Heumann deca73d233 Properly expand macros that have the same name as a keyword or typedef.
If such macros were used within other macros, they would generally not be expanded, due to the order in which operations were evaluated during preprocessing.

This is actually an issue that was fixed by the changes from ORCA/C 2.1.0 to 2.1.1 B3, but then broken again by commit d0b4b75970.

Here is an example with the name of a keyword:

#define X long int
#define long
X x;
int main(void) {
        return sizeof(x); /* should be sizeof(int) */
}

Here is an example with the name of a typedef:

typedef short T;
#define T long
#define X T
X x;
int main(void) {
        return sizeof(x); /* should be sizeof(long) */
}
2022-05-24 22:22:37 -05:00
Stephen Heumann 21f266c5df Require use of digraphs in macro redefinitions to match the original.
This is part of the general requirement that macro redefinitions be "identical" as defined in the standard.

This affects code like:

#define x [
#define x <:
2022-04-05 19:47:22 -05:00
Stephen Heumann a1d57c4db3 Allow ORCA/C-specific keywords to be disabled via a new pragma.
This allows those tokens (asm, comp, extended, pascal, and segment) to be used as identifiers, consistent with the C standards.

A new pragma (#pragma extensions) is introduced to control this. It might also be used for other things in the future.
2022-03-26 18:45:47 -05:00
Stephen Heumann b2edeb4ad1 Properly stringize tokens that start with a trigraph.
This did not work correctly before, because such tokens were recorded as starting with the third character of the trigraph.

Here is an example affected by this:

#define mkstr(a) # a
#include <stdio.h>
int main(void) {
        puts(mkstr(??!));
        puts(mkstr(??!??!));
        puts(mkstr('??<'));
        puts(mkstr(+??!));
        puts(mkstr(+??'));
}
2022-03-25 18:10:13 -05:00
Stephen Heumann f531f38463 Use suffixes on numeric constants in #pragma expand output.
A suffix will now be printed on any integer constant with a type other than int, or any floating constant with a type other than double. This ensures that all constants have the correct types, and also serves as documentation of the types.
2022-03-01 19:46:14 -06:00
Stephen Heumann 182cf66754 Properly stringize tokens with line continuations or non-initial trigraphs.
Previously, continuations or trigraphs would be included in the string as-is, which should not be the case because they are (conceptually) processed in earlier compilation phases. Initial trigraphs still do not get stringized properly, because the token starting position is not recorded correctly for them.

This fixes code like the following:

#define mkstr(a) # a
#include <stdio.h>
int main(void) {
        puts(mkstr(a\
bc));
        puts(mkstr(qr\
));
        puts(mkstr(\
xy));
        puts(mkstr(12??/
34));
        puts(mkstr('??<'));
}
2022-03-01 19:01:11 -06:00
Stephen Heumann fec7b57ec2 Generate a string representation of tokens merged with ##.
This is necessary for correct behavior if such tokens are subsequently stringized with #. Previously, only the first half of the token would be produced.

Here is an example demonstrating the issue:

#define mkstr(a) # a
#define in_between(a) mkstr(a)
#define joinstr(a,b) in_between(a ## b)
#include <stdio.h>
int main(void) {
        puts(joinstr(123,456));
        puts(joinstr(abc,def));
        puts(joinstr(dou,ble));
        puts(joinstr(+,=));
        puts(joinstr(:,>));
}
2022-02-22 18:48:34 -06:00
Stephen Heumann 6cfe8cc886 Remove an unused string representation of macro tokens.
The string representation of macro tokens is needed for some preprocessor operations, but we get this in other ways (e.g. based on tokenStart/tokenEnd).
2022-02-21 18:39:39 -06:00
Stephen Heumann 8f27b8abdb Print any ## tokens in #pragma expand output.
Note that ## will not currently be recognized as a token in some contexts, leading to it not being printed.
2022-02-20 20:53:37 -06:00
Stephen Heumann bf7a6fa5db Use separate functions for merging tokens with ## and merging adjacent strings.
These are conceptually separate operations occurring in different phases of the translation process. This change means that ## can no longer merge string constants: such operations will give an error about an illegal token. Cases like this are technically undefined behavior, so the old behavior could have been permitted, but it is clearer and more consistent with other compilers to treat this as an error.
2022-02-20 20:16:08 -06:00