Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222632 91177308-0d34-0410-b5e6-96231b3b80d8
Before this patch, the DAGCombiner only tried to convert build_vector dag nodes
into shuffles if all operands were either extract_vector_elt or undef.
This patch improves that logic and teaches the DAGCombiner how to deal with
build_vector dag nodes where one or more operands are zero. A build_vector
dag node with some zero operands is turned into a shuffle only if the resulting
shuffle mask is legal for the target.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222536 91177308-0d34-0410-b5e6-96231b3b80d8
This patch simplifies the logic that combines a pair of shuffle nodes into
a single shuffle if there is a legal mask. Also added comments to better
describe the algorithm. No functional change intended.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222522 91177308-0d34-0410-b5e6-96231b3b80d8
This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain.
I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr.
Differential Revision: http://reviews.llvm.org/D5699
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222340 91177308-0d34-0410-b5e6-96231b3b80d8
This is to be consistent with StringSet and ultimately with the standard
library's associative container insert function.
This lead to updating SmallSet::insert to return pair<iterator, bool>,
and then to update SmallPtrSet::insert to return pair<iterator, bool>,
and then to update all the existing users of those functions...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222334 91177308-0d34-0410-b5e6-96231b3b80d8
Some optimisations in DAGCombiner cause miscompilations for targets that use
TargetLowering::UndefinedBooleanContent, because they assume that the results
of a SELECT_CC node are boolean values, and can be safely ANDed, ORed and
XORed. These optimisations are only valid for targets that use
ZeroOrOneBooleanContent or ZeroOrNegativeOneBooleanContent.
This is a follow-up to D6210/r221693.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222123 91177308-0d34-0410-b5e6-96231b3b80d8
Indices into the table are stored in each MCRegisterClass instead of a pointer. A new method, getRegClassName, is added to MCRegisterInfo and TargetRegisterInfo to lookup the string in the table.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222118 91177308-0d34-0410-b5e6-96231b3b80d8
This patch teaches the DAGCombiner how to combine shuffles according to rules:
shuffle(shuffle(A, Undef, M0), B, M1) -> shuffle(B, A, M2)
shuffle(shuffle(A, B, M0), B, M1) -> shuffle(B, A, M2)
shuffle(shuffle(A, B, M0), A, M1) -> shuffle(B, A, M2)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222090 91177308-0d34-0410-b5e6-96231b3b80d8
Instead, we're going to separate metadata from the Value hierarchy. See
PR21532.
This reverts commit r221375.
This reverts commit r221373.
This reverts commit r221359.
This reverts commit r221167.
This reverts commit r221027.
This reverts commit r221024.
This reverts commit r221023.
This reverts commit r220995.
This reverts commit r220994.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221711 91177308-0d34-0410-b5e6-96231b3b80d8
What would happen before that commit is that the SDDbgValues associated with
a deallocated SDNode would be marked Invalidated, but SDDbgInfo would keep
a map entry keyed by the SDNode pointer pointing to this list of invalidated
SDDbgNodes. As the memory gets reused, the list might get wrongly associated
with another new SDNode. As the SDDbgValues are cloned when they are transfered,
this can lead to an exponential number of SDDbgValues being produced during
DAGCombine like in http://llvm.org/bugs/show_bug.cgi?id=20893
Note that the previous behavior wasn't really buggy as the invalidation made
sure that the SDDbgValues won't be used. This commit can be considered a
memory optimization and as such is really hard to validate in a unit-test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221709 91177308-0d34-0410-b5e6-96231b3b80d8
LLVM replaces the SelectionDAG pattern (xor (set_cc cc x y) 1) with
(set_cc !cc x y), which is only correct when the xor has type i1.
Instead, we should check that the constant operand to the xor is all
ones.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221693 91177308-0d34-0410-b5e6-96231b3b80d8
This patch improves the folding of vector AND nodes into blend operations for
targets that feature SSE4.1. A vector AND node where one of the operands is
a constant build_vector with elements that are either zero or all-ones can be
converted into a blend.
This allows for example to simplify the following code:
define <4 x i32> @test(<4 x i32> %A, <4 x i32> %B) {
%1 = and <4 x i32> %A, <i32 0, i32 0, i32 0, i32 -1>
%2 = and <4 x i32> %B, <i32 -1, i32 -1, i32 -1, i32 0>
%3 = or <4 x i32> %1, %2
ret <4 x i32> %3
}
Before this patch llc (-mcpu=corei7) generated:
andps LCPI1_0(%rip), %xmm0, %xmm0
andps LCPI1_1(%rip), %xmm1, %xmm1
orps %xmm1, %xmm0, %xmm0
retq
With this patch we generate a single 'vpblendw'.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221343 91177308-0d34-0410-b5e6-96231b3b80d8
call DAGCombiner. But we ran into a case (on Windows) where the
calling convention causes argument lowering to bail out of fast-isel,
and we end up in CodeGenAndEmitDAG() which does run DAGCombiner.
So, we need to make DAGCombiner check for 'optnone' after all.
Commit includes the test that found this, plus another one that got
missed in the original optnone work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221168 91177308-0d34-0410-b5e6-96231b3b80d8
Change `Instruction::getMetadata()` to return `Value` as part of
PR21433.
Update most callers to use `Instruction::getMDNode()`, which wraps the
result in a `cast_or_null<MDNode>`.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221024 91177308-0d34-0410-b5e6-96231b3b80d8
r212242 introduced a legalizer hook, originally to let AArch64 widen
v1i{32,16,8} rather than scalarize, because the legalizer expected, when
scalarizing the result of a conversion operation, to already have
scalarized the operands. On AArch64, v1i64 is legal, so that commit
ensured operations such as v1i32 = trunc v1i64 wouldn't assert.
It did that by choosing to widen v1 types whenever possible. However,
v1i1 types, for which there's no legal widened type, would still trigger
the assert.
This commit fixes that, by only scalarizing a trunc's result when the
operand has already been scalarized, and introducing an extract_elt
otherwise.
This is similar to r205625.
Fixes PR20777.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220937 91177308-0d34-0410-b5e6-96231b3b80d8
Earlier this summer I fixed an issue where we were incorrectly combining
multiple loads that had different constraints such alignment, invariance,
temporality, etc. Apparently in one case I made copt paste error and swapped
alignment and invariance.
Tests included.
rdar://18816719
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220933 91177308-0d34-0410-b5e6-96231b3b80d8
This is a first step for generating SSE rsqrt instructions for
reciprocal square root calcs when fast-math is allowed.
For now, be conservative and only enable this for AMD btver2
where performance improves significantly - for example, 29%
on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
(if we convert the data type to single-precision float).
This patch adds a two constant version of the Newton-Raphson
refinement algorithm to DAGCombiner that can be selected by any target
via a parameter returned by getRsqrtEstimate()..
See PR20900 for more details:
http://llvm.org/bugs/show_bug.cgi?id=20900
Differential Revision: http://reviews.llvm.org/D5658
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220570 91177308-0d34-0410-b5e6-96231b3b80d8
This adds support for legalization of instructions of the form:
[fp_conv] <1 x i1> %op to <1 x double>
where fp_conv is one of fpto[us]i, [us]itofp. This used to assert
because they were simply missing from the vector operand scalarizer.
A similar problem arose in r190830, with trunc instead.
Fixes PR20778.
Differential Revision: http://reviews.llvm.org/D5810
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220533 91177308-0d34-0410-b5e6-96231b3b80d8
x86's CMPXCHG -> EFLAGS consumer wasn't being recorded as a real EFLAGS
dependency because it was represented by a pair of CopyFromReg(EFLAGS) ->
CopyToReg(EFLAGS) nodes. ScheduleDAG was expecting the source to be an
implicit-def on the instruction, where the result numbers in the DAG and the
Uses list in TableGen matched up precisely.
The Copy notation seems much more robust, so this patch extends ScheduleDAG
rather than refactoring x86.
Should fix PR20376.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220529 91177308-0d34-0410-b5e6-96231b3b80d8
Our metadata scheme lazily assigns IDs to string metadata, but we have a mechanism to preassign them as well. Using a preassigned ID is helpful since we get compile time type checking, and avoid some (minimal) string construction and comparison. This change adds enum value for three existing metadata types:
+ MD_nontemporal = 9, // "nontemporal"
+ MD_mem_parallel_loop_access = 10, // "llvm.mem.parallel_loop_access"
+ MD_nonnull = 11 // "nonnull"
I went through an updated various uses as well. I made no attempt to get all uses; I focused on the ones which were easily grepable and easily to translate. For example, there were several items in LoopInfo.cpp I chose not to update.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220248 91177308-0d34-0410-b5e6-96231b3b80d8
TL;DR: Indexing maps with [] creates missing entries.
The long version:
When selecting lifetime intrinsics, we index the *static* alloca map with the AllocaInst we find for that lifetime. Trouble is, we don't first check to see if this is a dynamic alloca.
On the attached example, this causes a dynamic alloca to create an entry in the static map, and returns 0 (the default) as the frame index for that lifetime. 0 was used for the frame index of the stack protector, which given that it now has a lifetime, is coloured, and merged with other stack slots.
PEI would later trigger an assert because it expects the stack protector to not be dead.
This fix ensures that we only get frame indices for static allocas, ie, those in the map. Dynamic ones are effectively dropped, which is suboptimal, but at least isn't completely broken.
rdar://problem/18672951
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220099 91177308-0d34-0410-b5e6-96231b3b80d8
v2: use dyn_cast
fixup comments
v3: use cast
Reviewed-by: Matt Arsenault <arsenm2@gmail.com>
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220044 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Backends can use setInsertFencesForAtomic to signal to the middle-end that
montonic is the only memory ordering they can accept for
stores/loads/rmws/cmpxchg. The code lowering those accesses with a stronger
ordering to fences + monotonic accesses is currently living in
SelectionDAGBuilder.cpp. In this patch I propose moving this logic out of it
for several reasons:
- There is lots of redundancy to avoid: extremely similar logic already
exists in AtomicExpand.
- The current code in SelectionDAGBuilder does not use any target-hooks, it
does the same transformation for every backend that requires it
- As a result it is plain *unsound*, as it was apparently designed for ARM.
It happens to mostly work for the other targets because they are extremely
conservative, but Power for example had to switch to AtomicExpand to be
able to use lwsync safely (see r218331).
- Because it produces IR-level fences, it cannot be made sound ! This is noted
in the C++11 standard (section 29.3, page 1140):
```
Fences cannot, in general, be used to restore sequential consistency for atomic
operations with weaker ordering semantics.
```
It can also be seen by the following example (called IRIW in the litterature):
```
atomic<int> x = y = 0;
int r1, r2, r3, r4;
Thread 0:
x.store(1);
Thread 1:
y.store(1);
Thread 2:
r1 = x.load();
r2 = y.load();
Thread 3:
r3 = y.load();
r4 = x.load();
```
r1 = r3 = 1 and r2 = r4 = 0 is impossible as long as the accesses are all seq_cst.
But if they are lowered to monotonic accesses, no amount of fences can prevent it..
This patch does three things (I could cut it into parts, but then some of them
would not be tested/testable, please tell me if you would prefer that):
- it provides a default implementation for emitLeadingFence/emitTrailingFence in
terms of IR-level fences, that mimic the original logic of SelectionDAGBuilder.
As we saw above, this is unsound, but the best that can be done without knowing
the targets well (and there is a comment warning about this risk).
- it then switches Mips/Sparc/XCore to use AtomicExpand, relying on this default
implementation (that exactly replicates the logic of SelectionDAGBuilder, so no
functional change)
- it finally erase this logic from SelectionDAGBuilder as it is dead-code.
Ideally, each target would define its own override for emitLeading/TrailingFence
using target-specific fences, but I do not know the Sparc/Mips/XCore memory model
well enough to do this, and they appear to be dealing fine with the ARM-inspired
default expansion for now (probably because they are overly conservative, as
Power was). If anyone wants to compile fences more agressively on these
platforms, the long comment should make it clear why he should first override
emitLeading/TrailingFence.
Test Plan: make check-all, no functional change
Reviewers: jfb, t.p.northover
Subscribers: aemerson, llvm-commits
Differential Revision: http://reviews.llvm.org/D5474
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219957 91177308-0d34-0410-b5e6-96231b3b80d8
This patch changes the fast-math implementation for calculating sqrt(x) from:
y = 1 / (1 / sqrt(x))
to:
y = x * (1 / sqrt(x))
This has 2 benefits: less code / faster code and one less estimate instruction
that may lose precision.
The only target that will be affected (until http://reviews.llvm.org/D5658 is approved)
is PPC. The difference in codegen for PPC is 2 less flops for a single-precision sqrtf
or vector sqrtf and 4 less flops for a double-precision sqrt.
We also eliminate a constant load and extra register usage.
Differential Revision: http://reviews.llvm.org/D5682
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219445 91177308-0d34-0410-b5e6-96231b3b80d8