Increasingly I don't want to mix the integer and floating point tests,
especially with AVX where they are handled quite differently.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218233 91177308-0d34-0410-b5e6-96231b3b80d8
a more sane approach to AVX2 support.
Fundamentally, there is no useful way to lower integer vectors in AVX.
None. We always end up with a VINSERTF128 in the end, so we might as
well eagerly switch to the floating point domain and do everything
there. This cleans up lots of weird and unlikely to be correct
differences between integer and floating point shuffles when we only
have AVX1.
The other nice consequence is that by doing things this way we will make
it much easier to write the integer lowering routines as we won't need
to duplicate the logic to check for AVX vs. AVX2 in each one -- if we
actually try to lower a 256-bit vector as an integer vector, we have
AVX2 and can rely on it. I think this will make the code much simpler
and more comprehensible.
Currently, I've disabled *all* support for AVX2 so that we always fall
back to AVX. This keeps everything working rather than asserting. That
will go away with the subsequent series of patches that provide
a baseline AVX2 implementation.
Please note, I'm going to implement AVX2 *without access to hardware*.
That means I cannot correctness test this path. I will be relying on
those with access to AVX2 hardware to do correctness testing and fix
bugs here, but as a courtesy I'm trying to sketch out the framework for
the new-style vector shuffle lowering in the context of the AVX2 ISA.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218228 91177308-0d34-0410-b5e6-96231b3b80d8
input v8f32 shuffles which are not 128-bit lane crossing but have
different shuffle patterns in the low and high lanes. This removes most
of the extract/insert traffic that was unnecessary and is particularly
good at lowering cases where only one of the two lanes is shuffled at
all.
I've also added a collection of test cases with undef lanes because this
lowering is somewhat more sensitive to undef lanes than others.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218226 91177308-0d34-0410-b5e6-96231b3b80d8
in the high and low 128-bit lanes of a v8f32 vector.
No functionality change yet, but wanted to set up the baseline for my
next patch which will make these quite a bit better. =]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218224 91177308-0d34-0410-b5e6-96231b3b80d8
lowering when it can use a symmetric SHUFPS across both 128-bit lanes.
This required making the SHUFPS lowering tolerant of other vector types,
and adjusting our canonicalization to canonicalize harder.
This is the last of the clever uses of symmetry I've thought of for
v8f32. The rest of the tricks I'm aware of here are to work around
assymetry in the mask.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218216 91177308-0d34-0410-b5e6-96231b3b80d8
of a single element into a zero vector for v4f64 and v4i64 in AVX.
Ironically, there is less to see here because xor+blend is so crazy fast
that we can't really beat that to zero the high 128-bit lane.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218214 91177308-0d34-0410-b5e6-96231b3b80d8
UNPCKHPS with AVX vectors by recognizing those patterns when they are
repeated for both 128-bit lanes.
With this, we now generate the exact same (really nice) code for
Quentin's avx_test_case.ll which was the most significant regression
reported for the new shuffle lowering. In fact, I'm out of specific test
cases for AVX lowering, the rest were AVX2 I think. However, there are
a bunch of pretty obvious remaining things to improve with AVX...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218213 91177308-0d34-0410-b5e6-96231b3b80d8
important bits of cleverness: to detect and lower repeated shuffle
patterns between the two 128-bit lanes with a single instruction.
This patch just teaches it how to lower single-input shuffles that fit
this model using VPERMILPS. =] There is more that needs to happen here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218211 91177308-0d34-0410-b5e6-96231b3b80d8
generating the test cases to format things more consistently and
actually catch all the operand sequences that should be elided in favor
of the asm comments. No actual changes here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218210 91177308-0d34-0410-b5e6-96231b3b80d8
VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows
down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the
reciprocal throughput on Ivy Bridge CPUs much like it does everywhere
for 128-bits. There isn't a downside, so just eagerly use this
instruction when it suffices.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218208 91177308-0d34-0410-b5e6-96231b3b80d8
This expands the integer cases to cover the fact that AVX2 moves their
lane-crossing shuffles into the integer domain. It also adds proper
support for AVX2 run lines and the "ALL" group when it doesn't matter.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218206 91177308-0d34-0410-b5e6-96231b3b80d8
actual support for complex AVX shuffling tricks. We can do independent
blends of the low and high 128-bit lanes of an avx vector, so shuffle
the inputs into place and then do the blend at 256 bits. This will in
many cases remove one blend instruction.
The next step is to permute the low and high halves in-place rather than
extracting them and re-inserting them.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218202 91177308-0d34-0410-b5e6-96231b3b80d8
under AVX.
This really just documents the current state of the world. I'm going to
try to flesh it out to cover any test cases I plan to improve prior to
improving them so that the delta made by changes is actually visible to
code reviewers.
This is made easier by the fact that I now have a script to automate the
process of producing test cases including the check lines. =]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218199 91177308-0d34-0410-b5e6-96231b3b80d8
single-input shuffles with doubles. This allows them to fold memory
operands into the shuffle, etc. This is just the analog to the v4f32
case in my prior commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218193 91177308-0d34-0410-b5e6-96231b3b80d8
instruction for single-vector floating point shuffles. This in turn
allows the shuffles to fold a load into the instruction which is one of
the common regressions hit with the new shuffle lowering.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218190 91177308-0d34-0410-b5e6-96231b3b80d8
duplication of check lines. The idea is to have broad sets of
compilation modes that will frequently diverge without having to always
and immediately explode to the precise ISA feature set.
While this already helps due to VEX encoded differences, it will help
much more as I teach the new shuffle lowering about more of the new VEX
encoded instructions which can still be used to implement 128-bit
shuffles.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218188 91177308-0d34-0410-b5e6-96231b3b80d8
tricky case of single-element insertion into the zero lane of a zero
vector.
We can't just use the same pattern here as we do in every other vector
type because the general insertion logic can handle insertion into the
non-zero lane of the vector. However, in SSE4.1 with v4f32 vectors we
have INSERTPS that is a much better choice than the generic one for such
lowerings. But INSERTPS can do lots of other lowerings as well so
factoring its logic into the general insertion logic doesn't work very
well. We also can't just extract the core common part of the general
insertion logic that is faster (forming VZEXT_MOVL synthetic nodes that
lower to MOVSS when they can) because VZEXT_MOVL is often *faster* than
a blend while INSERTPS is slower! So instead we do a restrictive
condition on attempting to use the generic insertion logic to narrow it
to those cases where VZEXT_MOVL won't need a shuffle afterward and thus
will do better than INSERTPS. Then we try blending. Then we go back to
INSERTPS.
This still doesn't generate perfect code for some silly reasons that can
be fixed by tweaking the td files for lowering VZEXT_MOVL to use
XORPS+BLENDPS when available rather than XORPS+MOVSS when the input ends
up in a register rather than a load from memory -- BLENDPSrr has twice
the reciprocal throughput of MOVSSrr. Don't you love this ISA?
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218177 91177308-0d34-0410-b5e6-96231b3b80d8
floating point types and use it for both v2f64 and v2i64 single-element
insertion lowering.
This fixes the last non-AVX performance regression test case I've gotten
of for the new vector shuffle lowering. There is obvious analogous
lowering for v4f32 that I'll add in a follow-up patch (because with
INSERTPS, v4f32 requires special treatment). After that, its AVX stuff.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218175 91177308-0d34-0410-b5e6-96231b3b80d8
When looking through sign/zero-extensions the code would always assume there is
such an extension instruction and use the wrong operand for the address.
There was also a minor issue in the handling of 'AND' instructions. I
accidentially used a 'cast' instead of a 'dyn_cast'.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218161 91177308-0d34-0410-b5e6-96231b3b80d8
lowering to support both anyext and zext and to custom lower for many
different microarchitectures.
Using this allows us to get *exactly* the right code for zext and anyext
shuffles in all the vector sizes. For v16i8, the improvement is *huge*.
The new SSE2 test case added I refused to add before this because it was
sooooo muny instructions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218143 91177308-0d34-0410-b5e6-96231b3b80d8
The heuristic used by DAGCombine to form FMAs checks that the FMUL has only one
use, but this is overly-conservative on some systems. Specifically, if the FMA
and the FADD have the same latency (and the FMA does not compete for resources
with the FMUL any more than the FADD does), there is no need for the
restriction, and furthermore, forming the FMA leaving the FMUL can still allow
for higher overall throughput and decreased critical-path length.
Here we add a new TLI callback, enableAggressiveFMAFusion, false by default, to
elide the hasOneUse check. This is enabled for PowerPC by default, as most
PowerPC systems will benefit.
Patch by Olivier Sallenave, thanks!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218120 91177308-0d34-0410-b5e6-96231b3b80d8
to undef lanes as well as defined widenable lanes. This dramatically
improves the lowering we use for undef-shuffles in a zext-ish pattern
for SSE2.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218115 91177308-0d34-0410-b5e6-96231b3b80d8
Not sure why I only did SSSE3 here. Also, I've left out some of the SSE2
ones because the shuffles are so absurd it's not worth transcribing
them. Will try to fix them to be sane and then check them.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218114 91177308-0d34-0410-b5e6-96231b3b80d8
shuffles that are zext-ing.
Not a lot to see here; the undef lane variant is better handled with
pshufd, but this improves the actual zext pattern.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218112 91177308-0d34-0410-b5e6-96231b3b80d8
to the new vector shuffle lowering code.
This allows us to emit PMOVZX variants consistently for patterns where
it is a viable lowering. This instruction is both fast and allows us to
fold loads into it. This only hooks the new lowering up for i16 and i8
element widths, mostly so I could manage the change to the tests. I'll
add the i32 one next, although it is significantly less interesting.
One thing to note is that we already had some tests for these patterns
but those tests had far less horrible instructions. The problem is that
those tests weren't checking the strict start and end of the instruction
sequence. =[ As a consequence something changed in the lowering making
us generate *TERRIBLE* code for these patterns in SSE2 through SSSE3.
I've consolidated all of the tests and spelled out the madness that we
currently emit for these shuffles. I'm going to try to figure out what
has gone wrong here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218102 91177308-0d34-0410-b5e6-96231b3b80d8
With this optimization, we will not always insert zext for values crossing
basic blocks, but insert sext if the users of a value crossing basic block
has preference of sign predicate.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218101 91177308-0d34-0410-b5e6-96231b3b80d8
The fix is slightly different then x86 (see r216117) because the number of values
attached to a return can vary even for a single returned value (e.g., f64 yields
two returned values).
<rdar://problem/18352998>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218076 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This patch was originally in D5304 (I could not find a way to reopen that revision).
It was accepted, commited and broke the build bots because the overloading of
the constructor of ArrayRef for braced initializer lists is not supported by all
toolchains. I then reverted it, and propose this fixed version that uses a plain
C array instead in makeDMB (that array is then converted implicitly to an
ArrayRef, but that is not behind an ifdef). Could someone confirm me whether
initialization lists for plain C arrays are supported by every toolchain used
to build llvm ? Otherwise I can just initialize the array in the old way:
args[0] = ...; .. ; args[5] = ...;
Below is the description of the original patch:
```
I had only tested this code for ARMv7 and ARMv8. This patch adds several
fallback paths if the processor does not support dmb ish:
- dmb sy if a cortex-M with support for dmb
- mcr p15, #0, r0, c7, c10, #5 for ARMv6 (special instruction equivalent to a DMB)
These fallback paths were chosen based on the code for fence seq_cst.
Thanks to luqmana for having noticed this bug.
```
Test Plan: Added more cases to atomic-load-store.ll + make check-all
Reviewers: jfb, t.p.northover, luqmana
Subscribers: llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D5386
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218066 91177308-0d34-0410-b5e6-96231b3b80d8
There is no purpose in using it for single-input shuffles as
pshufd is just as fast and doesn't tie the two operands. This removes
a substantial amount of wrong-domain blend operations in SSSE3 mode. It
also completes the usage of PALIGNR for integer shuffles and addresses
one of the test cases Quentin hit with the new vector shuffle lowering.
There is still the question of whether and when to use this for floating
point shuffles. It is faster than shufps or shufpd but in the integer
domain. I don't yet really have a good heuristic here for when to use
this instruction for floating point vectors.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218038 91177308-0d34-0410-b5e6-96231b3b80d8
When folding the intrinsic flag into the branch or select we also have to
consider the fact if the intrinsic got simplified, because it changes the
flag we have to check for.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218034 91177308-0d34-0410-b5e6-96231b3b80d8
Small optimization in 'simplifyAddress'. When the offset cannot be encoded in
the load/store instruction, then we need to materialize the address manually.
The add instruction can encode a wider range of immediates than the load/store
instructions. This change tries to fold the offset into the add instruction
first before materializing the offset in a register.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218031 91177308-0d34-0410-b5e6-96231b3b80d8
The 'AND' instruction could be used to mask out the lower 32 bits of a register.
If this is done inside an address computation we might be able to fold the
instruction into the memory instruction itself.
and x1, x1, #0xffffffff ---> ldrb x0, [x0, w1, uxtw]
ldrb x0, [x0, x1]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218030 91177308-0d34-0410-b5e6-96231b3b80d8