re-combining shuffles because nothing was available in the wider vector
type.
The key observation (which I've put in the comments for future
maintainers) is that at this point, no further combining is really
possible. And so even though these shuffles trivially could be combined,
we need to actually do that as we produce them when producing them this
late in the lowering.
This fixes another (huge) part of the Halide vector shuffle regressions.
As it happens, this was already well covered by the tests, but I hadn't
noticed how bad some of these got. The specific patterns that turn
directly into unpckl/h patterns were occurring *many* times in common
vector processing code.
There are still more problems here sadly, but trying to incrementally
tease them apart and it looks like this is the core of the problem in
the splitting logic.
There is some chance of regression here, you can see it in the test
changes. Specifically, where we stop forming pshufb in some cases, it is
possible that pshufb was in fact faster. Intel "says" that pshufb is
slower than the instruction sequences replacing it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221852 91177308-0d34-0410-b5e6-96231b3b80d8
Optimize selects of i1 in the presence of 'true' and 'false' operands to simple
logic operations.
This fixes rdar://problem/18960150.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221848 91177308-0d34-0410-b5e6-96231b3b80d8
This folds the compare emission into the select emission when possible, so we
can directly use the flags and don't have to emit a separate compare.
Related to rdar://problem/18960150.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221847 91177308-0d34-0410-b5e6-96231b3b80d8
This is a follow-on to r221706 and r221731 and discussed in more detail in PR21385.
This patch also loosens the testcase checking for btver2. We know that the "1.0" will be loaded, but
we can't tell exactly when, so replace the CHECK-NEXT specifiers with plain CHECKs. The CHECK-NEXT
sequence relied on a quirk of post-RA-scheduling that may change independently of anything in these tests.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221819 91177308-0d34-0410-b5e6-96231b3b80d8
One of them (__memcpy_chk) was already there, the others were checked
by comparing function names.
Note that the fortified libfuncs are now part of TLI, but are always
available, because they aren't generated, only optimized into the
non-checking versions.
Differential Revision: http://reviews.llvm.org/D6179
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221817 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Reapply r221772. The old patch breaks the bot because the @indvar_32_bit test
was run whether NVPTX was enabled or not.
IndVarSimplify should not widen an indvar if arithmetics on the wider
indvar are more expensive than those on the narrower indvar. For
instance, although NVPTX64 treats i64 as a legal type, an ADD on i64 is
twice as expensive as that on i32, because the hardware needs to
simulate a 64-bit integer using two 32-bit integers.
Split from D6188, and based on D6195 which adds NVPTXTargetTransformInfo.
Fixes PR21148.
Test Plan:
Added @indvar_32_bit that verifies we do not widen an indvar if the arithmetics
on the wider type are more expensive. This test is run only when NVPTX is
enabled.
Reviewers: jholewinski, eliben, meheff, atrick
Reviewed By: atrick
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D6196
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221799 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Large-model was added first. With the addition of support for multiple PIC
models in LLVM, now add small-model PIC for 32-bit PowerPC, SysV4 ABI. This
generates more optimal code, for shared libraries with less than about 16380
data objects.
Test Plan: Test cases added or updated
Reviewers: joerg, hfinkel
Reviewed By: hfinkel
Subscribers: jholewinski, mcrosier, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D5399
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221791 91177308-0d34-0410-b5e6-96231b3b80d8
cases from Halide folks. This initial step was extracted from
a prototype change by Clay Wood to try and address regressions found
with Halide and the new vector shuffle lowering.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221779 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
IndVarSimplify should not widen an indvar if arithmetics on the wider
indvar are more expensive than those on the narrower indvar. For
instance, although NVPTX64 treats i64 as a legal type, an ADD on i64 is
twice as expensive as that on i32, because the hardware needs to
simulate a 64-bit integer using two 32-bit integers.
Split from D6188, and based on D6195 which adds NVPTXTargetTransformInfo.
Fixes PR21148.
Test Plan:
Added @indvar_32_bit that verifies we do not widen an indvar if the arithmetics
on the wider type are more expensive.
Reviewers: jholewinski, eliben, meheff, atrick
Reviewed By: atrick
Subscribers: jholewinski, llvm-commits
Differential Revision: http://reviews.llvm.org/D6196
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221772 91177308-0d34-0410-b5e6-96231b3b80d8
This patch enables the vec_vsx_ld and vec_vsx_st intrinsics for
PowerPC, which provide programmer access to the lxvd2x, lxvw4x,
stxvd2x, and stxvw4x instructions.
New LLVM intrinsics are provided to represent these four instructions
in IntrinsicsPowerPC.td. These are patterned after the similar
intrinsics for lvx and stvx (Altivec). In PPCInstrVSX.td, these
intrinsics are tied to the code gen patterns, with additional patterns
to allow plain vanilla loads and stores to still generate these
instructions.
At -O1 and higher the intrinsics are immediately converted to loads
and stores in InstCombineCalls.cpp. This will open up more
optimization opportunities while still allowing the correct
instructions to be generated. (Similar code exists for aligned
Altivec loads and stores.)
The new intrinsics are added to the code that checks for consecutive
loads and stores in PPCISelLowering.cpp, as well as to
PPCTargetLowering::getTgtMemIntrinsic().
There's a new test to verify the correct instructions are generated.
The loads and stores tend to be reordered, so the test just counts
their number. It runs at -O2, as it's not very effective to test this
at -O0, when many unnecessary loads and stores are generated.
I ended up having to modify vsx-fma-m.ll. It turns out this test case
is slightly unreliable, but I don't know a good way to prevent
problems with it. The xvmaddmdp instructions read and write the same
register, which is one of the multiplicands. Commutativity allows
either to be chosen. If the FMAs are reordered differently than
expected by the test, the register assignment can be different as a
result. Hopefully this doesn't change often.
There is a companion patch for Clang.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221767 91177308-0d34-0410-b5e6-96231b3b80d8
With this patch MCDisassembler::getInstruction takes an ArrayRef<uint8_t>
instead of a MemoryObject.
Even on X86 there is a maximum size an instruction can have. Given
that, it seems way simpler and more efficient to just pass an ArrayRef
to the disassembler instead of a MemoryObject and have it do a virtual
call every time it wants some extra bytes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221751 91177308-0d34-0410-b5e6-96231b3b80d8
Instead, we're going to separate metadata from the Value hierarchy. See
PR21532.
This reverts commit r221375.
This reverts commit r221373.
This reverts commit r221359.
This reverts commit r221167.
This reverts commit r221027.
This reverts commit r221024.
This reverts commit r221023.
This reverts commit r220995.
This reverts commit r220994.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221711 91177308-0d34-0410-b5e6-96231b3b80d8
This commit adds a new pass that can inject checks before indirect calls to
make sure that these calls target known locations. It supports three types of
checks and, at compile time, it can take the name of a custom function to call
when an indirect call check fails. The default failure function ignores the
error and continues.
This pass incidentally moves the function JumpInstrTables::transformType from
private to public and makes it static (with a new argument that specifies the
table type to use); this is so that the CFI code can transform function types
at call sites to determine which jump-instruction table to use for the check at
that site.
Also, this removes support for jumptables in ARM, pending further performance
analysis and discussion.
Review: http://reviews.llvm.org/D4167
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221708 91177308-0d34-0410-b5e6-96231b3b80d8
This is a first step for generating SSE rcp instructions for reciprocal
calcs when fast-math allows it. This is very similar to the rsqrt optimization
enabled in D5658 ( http://reviews.llvm.org/rL220570 ).
For now, be conservative and only enable this for AMD btver2 where performance
improves significantly both in terms of latency and throughput.
We may never enable this codegen for Intel Core* chips because the divider circuits
are just too fast. On SandyBridge, divss can be as fast as 10 cycles versus the 21
cycle critical path for the rcp + mul + sub + mul + add estimate.
Follow-on patches may allow configuration of the number of Newton-Raphson refinement
steps, add AVX512 support, and enable the optimization for more chips.
More background here: http://llvm.org/bugs/show_bug.cgi?id=21385
Differential Revision: http://reviews.llvm.org/D6175
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221706 91177308-0d34-0410-b5e6-96231b3b80d8
My original support for the general dynamic and local dynamic TLS
models contained some fairly obtuse hacks to generate calls to
__tls_get_addr when lowering a TargetGlobalAddress. Rather than
generating real calls, special GET_TLS_ADDR nodes were used to wrap
the calls and only reveal them at assembly time. I attempted to
provide correct parameter and return values by chaining CopyToReg and
CopyFromReg nodes onto the GET_TLS_ADDR nodes, but this was also not
fully correct. Problems were seen with two back-to-back stores to TLS
variables, where the call sequences ended up overlapping with unhappy
results. Additionally, since these weren't real calls, the proper
register side effects of a call were not recorded, so clobbered values
were kept live across the calls.
The proper thing to do is to lower these into calls in the first
place. This is relatively straightforward; see the changes to
PPCTargetLowering::LowerGlobalTLSAddress() in PPCISelLowering.cpp.
The changes here are standard call lowering, except that we need to
track the fact that these calls will require a relocation. This is
done by adding a machine operand flag of MO_TLSLD or MO_TLSGD to the
TargetGlobalAddress operand that appears earlier in the sequence.
The calls to LowerCallTo() eventually find their way to
LowerCall_64SVR4() or LowerCall_32SVR4(), which call FinishCall(),
which calls PrepareCall(). In PrepareCall(), we detect the calls to
__tls_get_addr and immediately snag the TargetGlobalTLSAddress with
the annotated relocation information. This becomes an extra operand
on the call following the callee, which is expected for nodes of type
tlscall. We change the call opcode to CALL_TLS for this case. Back
in FinishCall(), we change it again to CALL_NOP_TLS for 64-bit only,
since we require a TOC-restore nop following the call for the 64-bit
ABIs.
During selection, patterns in PPCInstrInfo.td and PPCInstr64Bit.td
convert the CALL_TLS nodes into BL_TLS nodes, and convert the
CALL_NOP_TLS nodes into BL8_NOP_TLS nodes. This replaces the code
removed from PPCAsmPrinter.cpp, as the BL_TLS or BL8_NOP_TLS
nodes can now be emitted normally using their patterns and the
associated printTLSCall print method.
Finally, as a result of these changes, all references to get-tls-addr
in its various guises are no longer used, so they have been removed.
There are existing TLS tests to verify the changes haven't messed
anything up). I've added one new test that verifies that the problem
with the original code has been fixed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221703 91177308-0d34-0410-b5e6-96231b3b80d8
The ISel lowering for global TLS access in PIC mode was creating a pseudo
instruction that is later expanded to a call, but the code was not
setting the hasCalls flag in the MachineFrameInfo alongside the adjustsStack
flag. This caused some functions to be mistakenly recognized as leaf functions,
and this in turn affected the decision to eliminate the frame pointer.
With the fix, hasCalls is properly set and the leaf frame pointer is correctly
preserved.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221695 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This patch enables code generation for the MIPS II target. Pre-Mips32
targets don't have the MUL instruction, so we add the correspondent
pattern that uses the MULT/MFLO combination in order to retrieve the
product.
This is WIP as we don't support code generation for select nodes due to
the lack of conditional-move instructions.
Reviewers: dsanders
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6150
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221686 91177308-0d34-0410-b5e6-96231b3b80d8
The canonical name when printing assembly is still $29. The reason is that
GAS does not accept "$hwr_ulr" at the moment.
This addresses the comments from r221307, which reverted the original
commit r221299.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221685 91177308-0d34-0410-b5e6-96231b3b80d8
The original commit r221299 was reverted in r221307. I removed the name
"hrw_ulr" ($29) from the original commit because two tests were failing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221681 91177308-0d34-0410-b5e6-96231b3b80d8
This fixes an issue with matching trunc -> assertsext -> zext on x86-64, which would not zero the high 32-bits. See PR20494 for details.
Recommitting - This time, with a hopefully working test.
Differential Revision: http://reviews.llvm.org/D6128
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221672 91177308-0d34-0410-b5e6-96231b3b80d8
AVX2 is available.
According to IACA, the new lowering has a throughput of 8 cycles instead of 13
with the previous one.
Althought this lowering kicks in some SPECs benchmarks, the performance
improvement was within the noise.
Correctness testing has been done for the whole range of uint32_t with the
following program:
uint4 v = (uint4) {0,1,2,3};
uint32_t i;
//Check correctness over entire range for uint4 -> float4 conversion
for( i = 0; i < 1U << (32-2); i++ )
{
float4 t = test(v);
float4 c = correct(v);
if( 0xf != _mm_movemask_ps( t == c ))
{
printf( "Error @ %vx: %vf vs. %vf\n", v, c, t);
return -1;
}
v += 4;
}
Where "correct" is the old lowering and "test" the new one.
The patch adds a test case for the two custom lowering instruction.
It also modifies the vector cost model, which is why cast.ll and uitofp.ll are
modified.
2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7
instructions instead of 4 (3 more constant loads).
rdar://problem/18153096>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221657 91177308-0d34-0410-b5e6-96231b3b80d8
In the case we optimize an integer extend away and replace it directly with the
source register, we also have to clear all kill flags at all its uses.
This is necessary, because the orignal IR instruction might be trivially dead,
but we replaced it with a nop at MI level.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221628 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
It currently only implements hasBranchDivergence, and will be extended
in later diffs.
Split from D6188.
Test Plan: make check-all
Reviewers: jholewinski
Reviewed By: jholewinski
Subscribers: llvm-commits, meheff, eliben, jholewinski
Differential Revision: http://reviews.llvm.org/D6195
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221619 91177308-0d34-0410-b5e6-96231b3b80d8
This fixes a few cases of:
* Wrong variable name style.
* Lines longer than 80 columns.
* Repeated names in comments.
* clang-format of the above.
This make the next patch a lot easier to read.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221615 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
... and after all that refactoring, it's possible to distinguish softfloat
floating point values from integers so this patch no longer breaks softfloat to
do it.
Remove direct handling of i32's in the N32/N64 ABI by promoting them to
i64. This more closely reflects the ABI documentation and also fixes
problems with stack arguments on big-endian targets.
We now rely on signext/zeroext annotations (already generated by clang) and
the Assert[SZ]ext nodes to avoid the introduction of unnecessary sign/zero
extends.
It was not possible to convert three tests to use signext/zeroext. These tests
are bswap.ll, ctlz-v.ll, ctlz-v.ll. It's not possible to put signext on a
vector type so we just accept the sign extends here for now. These tests don't
pass the vectors the same way clang does (clang puts multiple elements in the
same argument, these map 1 element to 1 argument) so we don't need to worry too
much about it.
With this patch, all known N32/N64 bugs should be fixed and we now pass the
first 10,000 tests generated by ABITest.py.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6117
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221534 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
One of the calls to AllocateStack (the one in LowerCall) doesn't look like
it should be there but it was there before and removing it breaks the
frame size calculation.
Reviewers: vmedic, theraven
Reviewed By: theraven
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6116
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221529 91177308-0d34-0410-b5e6-96231b3b80d8