Currently, this covers the following features, which should not cause compatibility problems:
-Recognize :: as a punctuator
-Allow one-argument _Static_assert
-Let variadic macro invocations omit final comma for empty varargs
-Define va_start() such that the second parameter is not required
-Allow UCNs less that \u00A0 in string literals and character constants
The corresponding Unicode characters (U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT and U+02C7 CARON) have the XID_Start and XID_Continue properties, so they are allowable in identifiers per C23 rules. We will just allow them in all language modes, since C99 through C17 permit the use of "other implementation-defined characters".
As of C23, UCNs within string literals or character constants can contain any valid Unicode code point, including ASCII characters or control characters.
The validity of UCNs within identifiers is now defined based on the XID_Start and XID_Continue Unicode properties. A helper program is used to generate tables of the allowed characters based on a Unicode data file. These can be updated for future Unicode versions by re-running the helper program using the updated Unicode data files.
For the moment, it behaves as expected with regard to token merging and stringization, but otherwise it doesn't do anything. (It can be used in attributes, but those aren't implemented yet.)
These are now treated equivalently to the old versions that start with _.
Note that thread_localsy and alignassy are canonicalized to _Thread_localsy and _Alignassy when recorded as declaration modifiers.
This is allowed based on the C standard syntax, but it previously gave a spurious error in ORCA/C, because the parenthesized type name at the beginning of the compound literal was parsed as the complete operand to sizeof.
Here is an example program affected by this:
int main(void) {
return sizeof (char[]){1,2,3}; // should return 3
}
These are tokens that follow the syntax for a preprocessing number, but not for an integer or floating constant after preprocessing. They are now allowed within the preprocessing phases of the compiler. They are not legal after preprocessing, but they may be used as operands of the # and ## preprocessor operators to produce legal tokens.
The issue was that if a 64-bit value was being loaded via one pointer and stored via another, the load and store parts could both be using y for their indexing, but they would clash with each other, potentially leading to loads coming from the wrong place.
Here are some examples that illustrate the problem:
/* example 1 */
int main(void) {
struct {
char c[16];
long long x;
} s = {.x = 0x1234567890abcdef}, *sp = &s;
long long ll, *llp = ≪
*llp = sp->x;
return ll != s.x; // should return 0
}
/* example 2 */
int main(void) {
struct {
char c[16];
long long x;
} s = {.x = 0x1234567890abcdef}, *sp = &s;
long long ll, *llp = ≪
unsigned i = 0;
*llp = sp[i].x;
return ll != s.x; // should return 0
}
/* example 3 */
int main(void) {
long long x[2] = {0, 0x1234567890abcdef}, *xp = x;
long long ll, *llp = ≪
unsigned i = 1;
*llp = xp[i];
return ll != x[1]; // should return 0
}
The code was not properly adding in the offset of the 64-bit value from the pointed-to location, so the wrong memory location would be accessed. This affected indirect accesses to non-initial structure members, when used as operands to certain operations.
Here is an example showing the problem:
#include <stdio.h>
long long x = 123456;
struct S {
long long a;
long long b;
} s = {0, 123456};
int main(void) {
struct S *sp = &s;
if (sp->b != x) {
puts("error");
}
}
They were not being properly recognized as structs/unions, so they were being passed by address rather than by value as they should be.
Here is an example affected by this:
struct S {int a,b,c,d;};
int f(struct S s) {
return s.a + s.b + s.c + s.d;
}
int main(void) {
const struct S s = {1,2,3,4};
return f(s);
}
The optimization applies to code sequences like:
dec abs
lda abs
beq ...
where the dec and lda were supposed to refer to the same location.
There were two problems with this optimization as written:
-It considered the dec and lda to refer to the same location even if they were actually references to different elements of the same array.
-It did not work in the case where the A register value was needed in subsequent code.
The first of these was already an issue in previous ORCA/C releases, as in the following example:
#pragma optimize -1
int x[2] = {0,0};
int main(void) {
--x[0];
if (x[1] != 0)
return 123;
return 0; /* should return 0 */
}
I do not believe the second problem was triggered by any code sequences generated in previous releases of ORCA/C, but it can be triggered after commit 4c402fc88, e.g. by the following example:
#pragma optimize -1
int x = 1;
int main(void) {
int y = 123;
--x;
return x == 0; /* should return 1 */
}
Since the circumstances where this peephole optimization was triggered validly are pretty obscure, just disabling it should have a minimal impact on the generated code.
There were a couple issues here:
*If the type name contained a semicolon (for struct/union member declarations), a spurious error would be reported.
*Tags or enumeration constants declared in the type name should be in scope within the loop, but were not.
These both stemmed from the way the parser handled the third expression, which was to save the tokens from it and re-inject them at the end of the loop. To get the scope issues right, the expression really needs to be evaluated at the point where it occurs, so we now do that. To enable that while still placing the code at the end of the loop, a mechanism to remove and re-insert sections of generated code is introduced.
Here is an example illustrating the issues:
int main(void) {
int i, j, x;
for (i = 0; i < 123; i += sizeof(struct {int a;}))
for (j = 0; j < 123; j += sizeof(enum E {A,B,C}))
x = i + j + A;
}
This affects certain places where code like the following could be generated:
bCC lab2
lab1 brl ...
lab2 ...
If lab1 is no longer referenced due to previous optimizations, it can be removed. This then allows the bCC+brl combination to be shortened to a single conditional branch, if the target is close enough.
This introduces a flag for tracking and potentially removing labels that are only used as the target of one branch. This could be used more widely, but currently it is only used for the specific code sequences shown above. Using it in other places could potentially open up possibilities for invalid native-code optimizations that were previously blocked due to the presence of the label.
This generates slightly better code for indexing a global/static char array with a signed 16-bit index and a positive offset, e.g. a[i+1].
Here is an example that is affected:
#pragma memorymodel 1
#pragma optimize -1
char a[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
int main(int argc, char *argv[]) {
return a[argc+2];
}
Specifically, this affects the case where a macro argument ends with the name of a function-like macro that takes 0 parameters. When that argument is initially expanded, the macro should not be expanded, even if there are parentheses within the macro that it is being passed to or the subsequent program code. This is the case because the C standards specify that "The argument’s preprocessing tokens are completely macro replaced before being substituted as if they formed the rest of the preprocessing file with no other preprocessing tokens being available." (The macro may still be expanded at a later stage, but that depends on other rules that determine whether the expansion is suppressed.) The logic for this was already present for the case of macros taking one or more argument; this extends it to apply to function-like macros taking zero arguments as well.
I'm not sure that this makes any practical difference while cycles of mutually-referential macros still aren't handled correctly (issue #48), but if that were fixed then there would be some cases that depend on this behavior.
Previously, there were a couple problems:
*If the parameter that was passed an empty argument appeared directly after the ##, the ## would permanently be removed from the macro record, affecting subsequent uses of the macro even if the argument was not empty.
*If the parameter that was passed an empty argument appeared between two ## operators, both would effectively be skipped, so the tokens to the left of the first ## and to the right of the second would not be combined.
This example illustrates both issues (not expected to compile; just check preprocessor output):
#pragma expand 1
#define x(a,b,c) a##b##c
x(1, ,3)
x(a,b,c)
Previously, it was not necessarily set correctly for the newly-generated token. This would result in incorrect behavior if that token was an operand to another ## operator, as in the following example:
#define x(a,b,c) a##b##c
x(1,2,3)
There was code that would attempt to use the cType field of the type record, but this is only valid for scalar types, not pointer types. In the case of a pointer type, the upper two bytes of the pointer would be interpreted as a cType value, and if they happened to have one of the values being tested for, incorrect intermediate code would be generated. The lower two bytes of the pointer would be used as a baseType value; this would most likely result in "compiler error" messages from the code generator, but might cause incorrect code generation with no errors if that value happened to correspond to a real baseType.
Code like the following might cause this error, although it only occurs if pointers have certain values and therefore depends on the memory layout at compile time:
void f(const int **p) {
(*p)++;
}
This bug was introduced in commit f2a66a524a.
Division by zero produces undefined behavior if it is evaluated, but in general we cannot tell whether a given expression will actually be evaluated at run time, so we should not report this as a compile-time error.
We still report an error for division by zero in constant expressions that need to be evaluated at compile time. We also still produce a lint message about division by zero if the appropriate flag is enabled.