This is preparatory to supporting designated initializers.
Any struct/union type with an anonymous member now forces .sym file generation to end, since we do not have a scheme for serializing this information in a .sym file. It would be possible to do so, but for now we just avoid this situation for simplicity.
This is a minimal implementation that does not actually inline anything, but it is intended to implement the semantics defined by the C99 and later standards.
One complication is that a declaration that appears somewhere after the function body may create an external definition for a function that appeared to be an inline definition when it was defined. To support this while preserving ORCA/C's general one-pass compilation strategy, we generate code even for inline definitions, but treat them as private and add the prefix "~inline~" to the name. If they are "un-inlined" based on a later declaration, we generate a stub with external linkage that just jumps to the apparently-inline function.
This still has a few issues. A \ token may not be followed by u or U (because this triggers UCN processing). We should scan through the whole possible UCN until we can confirm whether it is actually a UCN, but that would require more lookahead. Also, \ is not handled correctly in stringization (it should form escape sequences).
This implements the catch-all category for preprocessing tokens for "each non-white-space character that cannot be one of the above" (C17 section 6.4). These may appear in skipped code, or in macros or macro parameters if they are never expanded or are stringized during macro processing. The affected characters are $, @, `, and many extended characters.
It is still an error if these tokens are used in contexts where they remain present after preprocessing. If #pragma ignore bit 0 is clear, these characters are also reported as errors in skipped code or preprocessor constructs.
If the extern declaration refers to a global variable/function for which a declaration is already visible, the inner declaration should have the composite type (and it is an error if the types are incompatible).
This affects programs like the following:
static char a[60] = {5};
int main(void) {
extern char a[];
return sizeof(a)+a[0]; /* should return 65 */
}
Function declarations within a block are now entered within its symbol table rather than moved to the global one. Several error checks are also added or tightened.
This fixes at least one bug: if a function declared within a block had the same name as a variable in an outer scope, the symbol table entry for that variable could be corrupted, leading to spurious errors or incorrect code generation. This example program illustrates the problem:
/* This should compile without errors and return 2 */
int f(void) {return 1;}
int g(void) {return 2;}
int main(void) {
int (*f)(void) = g;
{
int f(void);
}
f = g;
return f();
}
Errors now detected include:
*Duplicate declarations of a static variable within a block (with the second one initialized)
*Duplicate declarations of the same variable as static and non-static
*Declaration of the same identifier as a typedef and a variable (at file scope)
This detects errors in the following cases that were previously missed:
* A function declaration and definition being part of the same overall declaration, e.g.:
void f(void), g(void) {}
* A function declaration (not definition) with no declaration specifiers, e.g.:
f(void);
(Function definitions with no declaration specifiers continue to be accepted by default, consistent with C90 rules.)
Previously, it generally just used the later type (except for function types where only the earlier one included a prototype). One effect of this is that if a global array is first declared with a size and then redeclared without one, the size information is lost, causing the proper space not to be allocated.
See C17 section 6.2.7 p4.
Here is an example affected by the array issue (dump the object file to see the size allocated):
int foo[50];
int foo[];
This seemed to be aimed at supporting lazy allocation of symbol tables. That could be a useful optimization, but the code that existed was incomplete and did not do anything useful. That or similar code could be reintroduced as part of a full implementation of lazy allocation, if it is ever done.
A function declared "inline" with an explicit "extern" storage class has the same semantics as if "inline" was omitted. (It is not an inline definition as defined in the C standards.) The "inline" specifier suggests that the function should be inlined, but it is legal to just ignore it, as we already do for "static inline" functions.
Also add a test for the inline function specifier.
This still works by "reconstructing" the string literal text, rather than just using what was in the source code. This is not what the standards specify and can result in slightly different behavior in some corner cases, but for realistic cases it is probably fine.
According to the C standards (C17 section 6.10.3 p8), they should not be subject to macro replacement.
A similar change also applies to the "STDC" in #pragma STDC ... (but we still allow macros for other pragmas, which is allowed as part of the implementation-defined behavior of #pragma).
Here is an example affected by this issue:
#define ifdef ifndef
#ifdef foobar
#error "foobar defined?"
#else
int main(void) {}
#endif
This could access arbitrary memory locations, and could theoretically cause misbehavior including falsely recognizing the token as a pragma or accessing a softswitch/IO location.
In certain error cases, tokens from subsequent lines could get treated as part of a preprocessor expression, causing subsequent code to be essentially ignored and producing strange error messages.
Here is an example (with an error) affected by this:
#pragma optimize 0 0
int main(void) {}
The scanner has been updated so that ... should always get recognized as a single token, so this is no longer necessary as a workaround. Any code that actually uses separate . . . is non-standard and will need to be changed.
This does not really do anything, because ORCA/C does not support multithreading, but the C11 and later standards indicate it should be allowed anyway.
The main changes made to most tests are:
*Declarations always include explicit types, not relying on implicit int. The declaration of main in most test programs is changed to be "int main (void) {...}", adding an explicit return type and a prototype. (There are still some non-prototyped functions, though.)
*Functions are always declared before use, either by including a header or by providing a declaration for the specific function. The latter approach is usually used for printf, to avoid requiring ORCA/C to process stdio.h when compiling every test case (which might make test runs noticeably slower).
*Make all return statements in non-void functions (e.g. main) return a value.
*Avoid some instances of undefined behavior and type errors in printf and scanf calls.
Several miscellaneous bugs are also fixed.
There are still a couple test cases that intentionally rely on the C89 behavior, to ensure it still works.
If a struct contained a function pointer with a prototyped parameter list, processing the parameters could reset the declaredTagOrEnumConst flag, potentially leading to a spurious error, as in this example:
struct S {
int (*f)(int);
};
This also gives a better error for structs declared as containing functions.
Note that this implementation allows anonymous structures and unions to participate in initialization. That is, you can have a braced initializer list corresponding to an anonymous structure or union. Also, anonymous structures within unions follow the initialization rules for structures (and vice versa).
I think the better interpretation of the standard text is that anonymous structures and unions cannot participate in initialization as such, and instead their members are treated as members of the containing structure or union for purposes of initialization. However, all other compilers I am aware of allow anonymous structures and unions to participate in initialization, so I have implemented it that way too.
This is necessary to correctly handle line continuations in a few places:
* Between an initial . and the subsequent digit in a floating constant
* Between the third and fourth characters of a %:%: digraph
* Between the second and third dots of a ... token
Previously, these would not be tokenized correctly, leading to spurious errors in the first and second cases above.
Here is a sample program illustrating the problem:
int printf(const char * restrict, ..\
\
??/
.);
int main(void) {
double d = .??/
\
??/
\
1234;
printf("%f\n", d);
}
They are supposed to be macros, according to the C standards. This ordinarily doesn't matter, but it can be detected by #ifdef, as in the following program:
#include <stdio.h>
#ifdef stdin
int main(void) {
puts("stdin is a macro");
}
#endif
Previously, it included some instances that violate the standard constraint that a declaration must declare a declarator, a tag, or an enum constant. As of commit f263066f61, this constraint is now enforced, so those cases would (properly) give errors.
C90 had constraints requiring # and ## tokens to only appear in preprocessing directives, but C99 and later removed those constraints, so this code is no longer necessary when targeting current languages versions. (It would be necessary in a "strict C90" mode, if that was ever implemented.)
The main practical effect of this is that # and ## tokens can be passed as parameters to macros, provided the macro either ignores or stringizes that parameter. # and ## tokens still have no role in the grammar of the C language after preprocessing, so they will be an unexpected token and produce some kind of error if they appear anywhere.
This also contains a change to ensure that a line containing one or more illegal characters (e.g. $) and then a # is not treated as a preprocessing directive.
The branch range calculation treated dcl directives as taking 2 bytes rather than 4, which could result in out-of-range branches. These could result in linker errors (for forward branches) or silently generating wrong code (for backward branches).
This patch now treats dcb, dcw, and dcl as separate directives in the native-code layer, so the appropriate length can be calculated for each.
Here is an example of code affected by this:
int main(int argc, char **argv) {
top:
if (!argc) { /* this caused a linker error */
asm {
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
dcl 0
}
goto top; /* this generated bad code with no error */
}
}
Previously, the assembly-level optimizations applied to code in asm statements. In many cases, this was fine (and could even do useful optimizations), but occasionally the optimizations could be invalid. This was especially the case if the assembly involved tricky things like self-modifying code.
To avoid these problems, this patch makes the assembly optimizers ignore code from asm statements, so it is always emitted as-is, without any changes.
This fixes#34.
If one step of peephole optimization produced code that can be further optimized with more peephole optimizations, that additional optimization was not always done. This makes sure the additional optimization is done in several such cases.
This was particularly likely to affect functions containing asm blocks (because CheckLabels would never trigger rescanning in them), but could also occur in other cases.
Here is an example affected by this (generating inefficient code to load a[1]):
#pragma optimize 1
int a[10];
void f(int x) {}
int main(int argc, char **argv) {
if (argc) return 0;
f(a[1]);
}
This accords with its definition in the C standards. For the time being, the old form of three separate tokens is still accepted too, because the ... token may not be scanned correctly in the obscure case where there is a line continuation between the second and third dots.
One observable effect of this is that there are no longer spaces between the dots in #pragma expand output.
This enforces the constraint from C17 section 6.7 p2 that declarations "shall declare at least a declarator (other than the parameters of a function or the members of a structure or union), a tag, or the members of an enumeration."
Somewhat relaxed rules are used for enums in the default loose type checking mode, similar to what GCC and Clang do.
A declaration of this exact form always declares the tag T within the current scope, and as such makes this "struct T" a distinct type from any other "struct T" type in an outer scope. (Similarly for unions.)
See C17 section 6.7.2.3 p7 (and corresponding places in all other C standards).
Here is an example of a program affected by this:
struct S {char a;};
int main(void) {
struct S;
struct S *sp;
struct S {long b;} s;
sp = &s;
sp->b = sizeof(*sp);
return s.b;
}