Now that we can create much more exhaustive X86 memory folding tests, this patch adds the missing AVX1/F16C floating point instruction stack foldings we can easily test for including the scalar intrinsics (add, div, max, min, mul, sub), conversions float/int to double, half precision conversions, rounding, dot product and bit test. The patch also adds a couple of obviously missing SSE instructions (more to follow once we have full SSE testing).
Now that scalar folding is working it broke a very old test (2006-10-07-ScalarSSEMiscompile.ll) - this test appears to make no sense as its trying to ensure that a scalar subtraction isn't folded as it 'would zero the top elts of the loaded vector' - this test just appears to be wrong to me.
Differential Revision: http://reviews.llvm.org/D7055
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226513 91177308-0d34-0410-b5e6-96231b3b80d8
No change in this commit, but clang was changed to also produce trivial comdats when
needed.
Original message:
Don't create new comdats in CodeGen.
This patch stops the implicit creation of comdats during codegen.
Clang now sets the comdat explicitly when it is required. With this patch clang and gcc
now produce the same result in pr19848.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226467 91177308-0d34-0410-b5e6-96231b3b80d8
Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call
instructions in order to represent the dependency on the TOC base pointer
value. Restricting this to ELF V2, however, does not seem to make sense: calls
under ELF V1 have the same dependence, and indirect calls have an r2 dependence
just as direct ones. Make sure the dependence is noted for all calls under both
ELF V1 and ELF V2.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226432 91177308-0d34-0410-b5e6-96231b3b80d8
Begun adding more exhaustive tests - all floating point instructions should now be either tested or have placeholders. We do seem to have a number of missing instructions, I will add a patch for review once the remaining working instructions are added.
I'll then move on to SSE tests and then the integer instructions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226400 91177308-0d34-0410-b5e6-96231b3b80d8
The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is
designed to work with both prototyped and non-prototyped/varargs functions. As
a result, GPRs and stack space are allocated for every argument, even those
that are passed in floating-point or vector registers.
GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that
do not have their address taken) to use the 'fast' calling convention.
When functions are using the 'fast' calling convention, don't allocate GPRs for
arguments passed in other types of registers, and don't allocate stack space for
arguments passed in registers. Other changes for the fast calling convention
may be added in the future.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226399 91177308-0d34-0410-b5e6-96231b3b80d8
R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is
reserved for use as an "environment pointer" for compilation models that
require such a thing. We don't, we also don't need a second scratch register,
and because we support only "local" patchpoint call targets, we might as well
let R11 be used for anyregcc patchpoints.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226369 91177308-0d34-0410-b5e6-96231b3b80d8
Loading 2 2x32-bit float vectors into the bottom half of a 256-bit vector
produced suboptimal code in AVX2 mode with certain IR combinations.
In particular, the IR optimizer folded 2f32 + 2f32 -> 4f32, 4f32 + 4f32
(undef) -> 8f32 into a 2f32 + 2f32 -> 8f32, which seems more canonical,
but then mysteriously generated rather bad code; the movq/movhpd combination
didn't match.
The problem lay in the BUILD_VECTOR optimization path. The 2f32 inputs
would get promoted to 4f32 by the type legalizer, eventually resulting
in a BUILD_VECTOR on two 4f32 into an 8f32. The BUILD_VECTOR then, recognizing
these were both half the output size, concatted them and then produced
a shuffle. However, the resulting concat + shuffle was more complex than
it should be; in the case where the upper half of the output is undef, we
probably want to generate shuffle + concat instead.
This enhancement causes the vector_shuffle combine step to recognize this
suboptimal pattern and correct it. I included it there instead of in BUILD_VECTOR
in case the same suboptimal pattern occurs for other reasons.
This results in the optimizer correctly producing the optimal movq + movhpd
sequence for all three variations on this IR, even with AVX2.
I've included a test case.
Radar link: rdar://problem/19287012
Fix for PR 21943.
From: Fiona Glaser <fglaser@apple.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226360 91177308-0d34-0410-b5e6-96231b3b80d8
Similar to the unaligned cases.
Test was generated with update_llc_test_checks.py.
Part of <rdar://problem/17688758>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226296 91177308-0d34-0410-b5e6-96231b3b80d8
This patch disables target specific combine on X86ISD::INSERTPS dag nodes
if optlevel is CodeGenOpt::None.
The backend currently implements a target specific combine rule that converts
a vector load used by an INSERTPS dag node into a scalar load plus a
scalar_to_vector. This allows ISel to select a single INSERTPSrm instead of
two instructions (i.e. a vector load plus INSERTPSrr).
However, the existing target combine rule on INSERTPS nodes only works under
the assumption that ISel will always be able to match an INSERTPSrm. This is
not true in general at -O0, since the backend only allows folding a load into
the memory operand of an instruction if the optimization level is not
CodeGenOpt::None.
In the example below:
//
__m128 test(__m128 a, __m128 *b) {
__m128 c = _mm_insert_ps(a, *b, 1 << 6);
return c;
}
//
Before this patch, at -O0, the backend would have canonicalized the load to 'b'
into a scalar load plus scalar_to_vector. Later on, ISel would have selected an
INSERTPSrr leaving the insertps mask in an inconsistent state:
movss 4(%rdi), %xmm1
insertps $64, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1,2,3].
With this patch, the backend avoids folding the vector load into the operand of
the INSERTPS. The new codegen at -O0 is:
movaps (%rdi), %xmm1
insertps $64, %xmm1, %xmm0 # %xmm1[1],xmm0[1,2,3].
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226277 91177308-0d34-0410-b5e6-96231b3b80d8
The current 'big vectors' stack folded reload testing pattern is very bulky and makes it difficult to test all instructions as big vectors will tend to use only the ymm instruction implementations.
This patch changes the tests to use a nop call that lists explicit xmm registers as sideeffects, with this we can force a partial register spill of the relevant registers and then check that the reload is correctly folded. The asm generated only adds the forced spill, a nop instruction and a couple of extra labels (a fraction of the current approach).
More exhaustive tests will follow shortly, I've added some extra tests (the xmm versions of some of the existing folding tests) as a starting point.
Differential Revision: http://reviews.llvm.org/D6932
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226264 91177308-0d34-0410-b5e6-96231b3b80d8
Bill Schmidt pointed out that some adjustments would be needed to properly
support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available
as a scratch register, so we need to use R12. R12 is also available under ELF
V1, so to maintain consistency, I flipped the order to make R12 the first
scratch register in the array under both ABIs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226247 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r226173, adding r226038 back.
No change in this commit, but clang was changed to also produce trivial comdats for
costructors, destructors and vtables when needed.
Original message:
Don't create new comdats in CodeGen.
This patch stops the implicit creation of comdats during codegen.
Clang now sets the comdat explicitly when it is required. With this patch clang and gcc
now produce the same result in pr19848.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226242 91177308-0d34-0410-b5e6-96231b3b80d8
Instructions with 1 operand can still use source modifiers,
so make sure we don't print an extra comma afterwards.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226226 91177308-0d34-0410-b5e6-96231b3b80d8
Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the
POWER7, A2 and earlier cores) are really pointers to a function descriptor, a
structure with three pointers: the actual pointer to the code to which to jump,
the pointer to the TOC needed by the callee, and an environment pointer. We
used to chain these loads, and make them opaque to the rest of the optimizer,
so that they'd always occur directly before the call. This is not necessary,
and in fact, highly suboptimal on embedded cores. Once the function pointer is
known, the loads can be performed ahead of time; in fact, they can be hoisted
out of loops.
Now these function descriptors are almost always generated by the linker, and
thus the contents of the descriptors are invariant. As a result, by default,
we'll mark the associated loads as invariant (allowing them to be hoisted out
of loops). I've added a target feature to turn this off, however, just in case
someone needs that option (constructing an on-stack descriptor, casting it to a
function pointer, and then calling it cannot be well-defined C/C++ code, but I
can imagine some JIT-compilation system doing so).
Consider this simple test:
$ cat call.c
typedef void (*fp)();
void bar(fp x) {
for (int i = 0; i < 1600000000; ++i)
x();
}
$ cat main.c
typedef void (*fp)();
void bar(fp x);
void foo() {}
int main() {
bar(foo);
}
On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads
as invariant brings the execution time down to ~8 seconds from ~32 seconds with
the loads in the loop.
The difference on the POWER7 is smaller. Compiling with:
gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2]
clang -O3 -mcpu=native call.c main.c : ~5.3 seconds
clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds
(looks like we'd benefit from additional loop unrolling here, as a first
guess, because this is faster with the extra loads)
The -mno-invariant-function-descriptors will be added to Clang shortly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226207 91177308-0d34-0410-b5e6-96231b3b80d8
Reapply r226071 with fixes. Two fixes:
1. We need to manually remove the old and create the new 'deaf defs'
associated with physical register definitions when we move the definition of
the physical register from the copy point to the point of the original vreg def.
This problem was picked up by the machinstr verifier, and could trigger a
verification failure on test/CodeGen/X86/2009-02-12-DebugInfoVLA.ll, so I've
turned on the verifier in the tests.
2. When moving the def point of the phys reg up, we need to make sure that it
is neither defined nor read in between the two instructions. We don't, however,
extend the live ranges of phys reg defs to cover uses, so just checking for
live-range overlap between the pair interval and the phys reg aliases won't
pick up reads. As a result, we manually iterate over the range and check for
reads.
A test soon to be committed to the PowerPC backend will test this change.
Original commit message:
[RegisterCoalescer] Remove copies to reserved registers
This allows the RegisterCoalescer to join "non-flipped" range pairs with a
physical destination register -- which allows the RegisterCoalescer to remove
copies like this:
<vreg> = something (maybe a load, for example)
... (things that don't use PHYSREG)
PHYSREG = COPY <vreg>
(with all of the restrictions normally applied by the RegisterCoalescer: having
compatible register classes, etc. )
Previously, the RegisterCoalescer handled only the opposite case (copying
*from* a physical register). I don't handle the problem fully here, but try to
get the common case where there is only one use of <vreg> (the COPY).
An upcoming commit to the PowerPC backend will make this pattern much more
common on PPC64/ELF systems.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226200 91177308-0d34-0410-b5e6-96231b3b80d8
Reverting this while I investigate some bad behavior this is causing. As a
possibly-related issue, adding -verify-machineinstrs to one of the test cases
now fails because of this change:
llc test/CodeGen/X86/2009-02-12-DebugInfoVLA.ll -march=x86-64 -o - -verify-machineinstrs
*** Bad machine code: No instruction at def index ***
- function: foo
- basic block: BB#0 return (0x10007e21f10) [0B;736B)
- liverange: [128r,128d:9)[160r,160d:8)[176r,176d:7)[336r,336d:6)[464r,464d:5)[480r,480d:4)[624r,624d:3)[752r,752d:2)[768r,768d:1)[78
4r,784d:0) 0@784r 1@768r 2@752r 3@624r 4@480r 5@464r 6@336r 7@176r 8@160r 9@128r
- register: %DS
Valno #3 is defined at 624r
*** Bad machine code: Live segment doesn't end at a valid instruction ***
- function: foo
- basic block: BB#0 return (0x10007e21f10) [0B;736B)
- liverange: [128r,128d:9)[160r,160d:8)[176r,176d:7)[336r,336d:6)[464r,464d:5)[480r,480d:4)[624r,624d:3)[752r,752d:2)[768r,768d:1)[78
4r,784d:0) 0@784r 1@768r 2@752r 3@624r 4@480r 5@464r 6@336r 7@176r 8@160r 9@128r
- register: %DS
[624r,624d:3)
LLVM ERROR: Found 2 machine code errors.
where 624r corresponds exactly to the interval combining change:
624B %RSP<def> = COPY %vreg16; GR64:%vreg16
Considering merging %vreg16 with %RSP
RHS = %vreg16 [608r,624r:0) 0@608r
updated: 608B %RSP<def> = MOV64rm <fi#3>, 1, %noreg, 0, %noreg; mem:LD8[%saved_stack.1]
Success: %vreg16 -> %RSP
Result = %RSP
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226086 91177308-0d34-0410-b5e6-96231b3b80d8
This allows the RegisterCoalescer to join "non-flipped" range pairs with a
physical destination register -- which allows the RegisterCoalescer to remove
copies like this:
<vreg> = something (maybe a load, for example)
... (things that don't use PHYSREG)
PHYSREG = COPY <vreg>
(with all of the restrictions normally applied by the RegisterCoalescer: having
compatible register classes, etc. )
Previously, the RegisterCoalescer handled only the opposite case (copying
*from* a physical register). I don't handle the problem fully here, but try to
get the common case where there is only one use of <vreg> (the COPY).
An upcoming commit to the PowerPC backend will make this pattern much more
common on PPC64/ELF systems.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226071 91177308-0d34-0410-b5e6-96231b3b80d8
This commit moves `MDLocation`, finishing off PR21433. There's an
accompanying clang commit for frontend testcases. I'll attach the
testcase upgrade script I used to PR21433 to help out-of-tree
frontends/backends.
This changes the schema for `DebugLoc` and `DILocation` from:
!{i32 3, i32 7, !7, !8}
to:
!MDLocation(line: 3, column: 7, scope: !7, inlinedAt: !8)
Note that empty fields (line/column: 0 and inlinedAt: null) don't get
printed by the assembly writer.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226048 91177308-0d34-0410-b5e6-96231b3b80d8
This patch stops the implicit creation of comdats during codegen.
Clang now sets the comdat explicitly when it is required. With this patch clang and gcc
now produce the same result in pr19848.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226038 91177308-0d34-0410-b5e6-96231b3b80d8
Some benchmarks have shown that this could lead to a potential
performance benefit, and so adding some flags to try to help measure the
difference.
A possible explanation. In diamond-shaped CFGs (A followed by either
B or C both followed by D), putting B and C both in between A and
D leads to the code being less dense than it could be. Always either
B or C have to be skipped increasing the chance of cache misses etc.
Moving either B or C to after D might be beneficial on average.
In the long run, but we should probably do a better job of analyzing the
basic block and branch probabilities to move the correct one of B or
C to after D. But even if we don't use this in the long run, it is
a good baseline for benchmarking.
Original patch authored by Daniel Jasper with test tweaks and a second
flag added by me.
Differential Revision: http://reviews.llvm.org/D6969
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226034 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Kit Barton.
Support for the ICBT instruction is currently present, but limited to
embedded processors. This change adds a new FeatureICBT that can be used
to identify whether the ICBT instruction is available on a specific processor.
Two new tests are added:
* Positive test to ensure the icbt instruction is present when using
-mcpu=pwr8
* Negative test to ensure the icbt instruction is not generated when
using -mcpu=pwr7
Both test cases use the Prefetch opcode in LLVM. They are based on the
ppc64-prefetch.ll test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226033 91177308-0d34-0410-b5e6-96231b3b80d8
This commit refines the pattern for the octeon seq/seqi/sne/snei instructions.
The target register is set to 0 or 1 according to the result of the comparison.
In C, this is something like
rd = (unsigned long)(rs == rt)
This commit adds a zext to bring the result to i64. With this change the
instruction is selected for this type of code. (gcc produces the same code for
the above C code.)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225968 91177308-0d34-0410-b5e6-96231b3b80d8