AVX-512.
There is no interesting logic yet. Everything ends up eventually
delegating to the generic code to split the vector and shuffle the
halves. Interestingly, that logic does a significantly better job of
lowering all of these types than the generic vector expansion code does.
Mostly, it lets most of the cases fall back to nice AVX2 code rather
than all the way back to SSE code paths.
Step 2 of basic AVX-512 support in the new vector shuffle lowering. Next
up will be to incrementally add direct support for the basic instruction
set to each type (adding tests first).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218585 91177308-0d34-0410-b5e6-96231b3b80d8
assertion, making the name generic, and improving the documentation.
Step 1 in adding very primitive support for AVX-512. No functionality
changed yet.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218584 91177308-0d34-0410-b5e6-96231b3b80d8
vectors.
Someone will need to build the AVX512 lowering, which should follow
AVX1 and AVX2 *very* closely for AVX512F and AVX512BW resp. I've added
a dummy test which is a port of the v8f32 and v8i32 tests from AVX and
AVX2 to v8f64 and v8i64 tests for AVX512F and AVX512BW. Hopefully this
is enough information for someone to implement proper lowering here. If
not, I'll be happy to help, but right now the AVX-512 support isn't
a priority for me.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218583 91177308-0d34-0410-b5e6-96231b3b80d8
lowerings.
This was hopelessly broken. First, the x86 backend wants '-1' to be the
element value representing true in a boolean vector, and second the
operand order for VSELECT is backwards from the actual x86 instructions.
To make matters worse, the backend is just using '-1' as the true value
to get the high bit to be set. It doesn't actually symbolically map the
'-1' to anything. But on x86 this isn't quite how it works: there *only*
the high bit is relevant. As a consequence weird non-'-1' values like
0x80 actually "work" once you flip the operands to be backwards.
Anyways, thanks to Hal for helping me sort out what these *should* be.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218582 91177308-0d34-0410-b5e6-96231b3b80d8
new vector shuffle target DAG combines -- it helps to actually test for
the value you want rather than just using an integer in a boolean
context.
Have I mentioned that I loathe implicit conversions recently? :: sigh ::
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218576 91177308-0d34-0410-b5e6-96231b3b80d8
of widening masks.
We can't widen a zeroing mask unless both elements that would be merged
are either zeroed or undef. This is the only way to widen a mask if it
has a zeroed element.
Also clean up the code here by ordering the checks in a more logical way
and by using the symoblic values for undef and zero. I'm actually torn
on using the symbolic values because the existing code is littered with
the assumption that -1 is undef, and moreover that entries '< 0' are the
special entries. While that works with the values given to these
constants, using the symbolic constants actually makes it a bit more
opaque why this is the case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218575 91177308-0d34-0410-b5e6-96231b3b80d8
I spotted this by inspection when debugging something else, so I have no
test case what-so-ever, and am not even sure it is possible to
realistically trigger the bug. But this is what was intended here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218565 91177308-0d34-0410-b5e6-96231b3b80d8
and in the target shuffle combining when trying to widen vector
elements.
Previously only one of these was correct, and we didn't correctly
propagate zeroing target shuffle masks (which have a different sentinel
value from undef in non- target shuffle masks now). This isn't just
a missed optimization, this caused us to drop zeroing shuffles on the
floor and miscompile code. The added test case is one example of that.
There are other fixes to the test suite as a consequence of this as well
as restoring the undef elements in some of the masks that were lost when
I brought sanity to the actual *value* of the undef and zero sentinels.
I've also just cleaned up some of the PSHUFD and PSHUFLW and PSHUFHW
combining code, but that code really needs to go. It was a nice initial
attempt, but it isn't very principled and the recursive shuffle combiner
is much more powerful.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218562 91177308-0d34-0410-b5e6-96231b3b80d8
that managed to elude all of my fuzz testing historically. =/
Something changed to allow this code path to actually be exercised and
it was doing bad things. It is especially heavily exercised by the
patterns that emerge when doing AVX shuffles that end up lowered through
the 128-bit code path.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218540 91177308-0d34-0410-b5e6-96231b3b80d8
layer of tie-breaking sorting, it really helps to check that you're in
a tie first. =] Otherwise the whole thing cycles infinitely. Test case
added, another one found through fuzz testing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218523 91177308-0d34-0410-b5e6-96231b3b80d8
AVX support.
New test cases included. Note that none of the existing test cases
covered these buggy code paths. =/ Also, it is clear from this that
SHUFPS and SHUFPD are the most bug prone shuffle instructions in x86. =[
These were all detected by fuzz-testing. (I <3 fuzz testing.)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218522 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
I originally tried doing this specifically for X86 in the backend in D5091,
but it was rather brittle and generally running too late to be general.
Furthermore, other targets may want to implement similar optimizations.
So I reimplemented it at the IR-level, fitting it into AtomicExpandPass
as it interacts with that pass (which could not be cleanly done before
at the backend level).
This optimization relies on a new target hook, which is only used by X86
for now, as the correctness of the optimization on other targets remains
an open question. If it is found correct on other targets, it should be
trivial to enable for them.
Details of the optimization are discussed in D5091.
Test Plan: make check-all + a new test
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5422
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218455 91177308-0d34-0410-b5e6-96231b3b80d8
v4f64 and v8f32 shuffles when they are lane-crossing. We have fully
general lane-crossing permutation functions in AVX2 that make this easy.
Part of this also changes exactly when and how these vectors are split
up when we don't have AVX2. This isn't always a win but it usually is
a win, so on the balance I think its better. The primary regressions are
all things that just need to be fixed anyways such as modeling when
a blend can be completely accomplished via VINSERTF128, etc.
Also, this highlights one of the few remaining big features: we do
a really poor job of inserting elements into AVX registers efficiently.
This completes almost all of the big tricks I have in mind for AVX2. The
only things left that I plan to add:
1) element insertion smarts
2) palignr and other fairly specialized lowerings when they happen to
apply
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218449 91177308-0d34-0410-b5e6-96231b3b80d8
256-bit vectors with lane-crossing.
Rather than immediately decomposing to 128-bit vectors, try flipping the
256-bit vector lanes, shuffling them and blending them together. This
reduces our worst case shuffle by a pretty significant margin across the
board.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218446 91177308-0d34-0410-b5e6-96231b3b80d8
lowering where it only used the mask of the low 128-bit lane rather than
the entire mask.
This allows the new lowering to correctly match the unpack patterns for
v8i32 vectors.
For reference, the reason that we check for the the entire mask rather
than checking the repeated mask is because the repeated masks don't
abide by all of the invariants of normal masks. As a consequence, it is
safer to use the full mask with functions like the generic equivalence
test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218442 91177308-0d34-0410-b5e6-96231b3b80d8
reduce the amount of checking we do here.
The first realization is that only non-crossing cases between 128-bit
lanes are handled by almost the entire function. It makes more sense to
handle the crossing cases first.
THe second is that until we actually are going to generate fancy shared
lowering strategies that use the repeated semantics of the v8i16
lowering, we should waste time checking for repeated masks. It is
simplest to directly test for the entire unpck masks anyways, so we
gained nothing from this.
This also matches the structure of v32i8 more closely.
No functionality changed here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218441 91177308-0d34-0410-b5e6-96231b3b80d8
lowering.
This completes the basic AVX2 feature support, but there are still some
improvements I'd like to do to really get the last mile of performance
here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218440 91177308-0d34-0410-b5e6-96231b3b80d8
for this now.
Should prevent folks from running afoul of this and not knowing why
their code won't instruction select the way I just did...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218436 91177308-0d34-0410-b5e6-96231b3b80d8
missing test cases for it.
Unsurprisingly, without test cases, there were bugs here. Surprisingly,
this bug wasn't caught at compile time. Yep, there is an X86ISD::BLENDV.
It isn't wired to anything. Oops. I'll fix than next.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218434 91177308-0d34-0410-b5e6-96231b3b80d8
lowering.
This also implements the fancy blend lowering for v16i16 using AVX2 and
teaches the X86 backend to print shuffle masks for 256-bit PSHUFB
and PBLENDW instructions. It also makes the mask decoding correct for
PBLENDW instructions. The yaks, they are legion.
Tests are updated accordingly. There are some missing tests for the
VBLENDVB lowering, but I'll add those in a follow-up as this commit has
accumulated enough cruft already.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218430 91177308-0d34-0410-b5e6-96231b3b80d8
into unblended shuffles and a blend.
This is the consistent fallback for the lowering paths that have fast
blend operations available, and its getting quite repetitive.
No functionality changed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218399 91177308-0d34-0410-b5e6-96231b3b80d8
the native AVX2 instructions.
Note that the test case is really frustrating here because VPERMD
requires the mask to be in the register input and we don't produce
a comment looking through that to the constant pool. I'm going to
attempt to improve this in a subsequent commit, but not sure if I will
succeed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218347 91177308-0d34-0410-b5e6-96231b3b80d8
detection. It was incorrectly handling undef lanes by actually treating
an undef lane in the first 128-bit lane as a *numeric* shuffle value.
Fortunately, this almost always DTRT and disabled detecting repeated
patterns. But not always. =/ This patch introduces a much more
principled approach and fixes the miscompiles I spotted by inspection
previously.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218346 91177308-0d34-0410-b5e6-96231b3b80d8
shuffles using the AVX2 instructions. This is the first step of cutting
in real AVX2 support.
Note that I have spotted at least one bug in the test cases already, but
I suspect it was already present and just is getting surfaced. Will
investigate next.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218338 91177308-0d34-0410-b5e6-96231b3b80d8
add VPBLENDD to the InstPrinter's comment generation so we get nice
comments everywhere.
Now that we have the nice comments, I can see the bug introduced by
a silly typo in the commit that enabled VPBLENDD, and have fixed it. Yay
tests that are easy to inspect.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218335 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
AtomicExpand already had logic for expanding wide loads and stores on LL/SC
architectures, and for expanding wide stores on CmpXchg architectures, but
not for wide loads on CmpXchg architectures. This patch fills this hole,
and makes use of this new feature in the X86 backend.
Only one functionnal change: we now lose the SynchScope attribute.
It is regrettable, but I have another patch that I will submit soon that will
solve this for all of AtomicExpand (it seemed better to split it apart as it
is a different concern).
Test Plan: make check-all (lots of tests for this functionality already exist)
Reviewers: jfb
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D5404
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218332 91177308-0d34-0410-b5e6-96231b3b80d8
VPBLENDD where appropriate even on 128-bit vectors.
According to Agner's tables, this instruction is significantly higher
throughput (can execute on any port) on Haswell chips so we should
aggressively try to form it when available.
Sadly, this loses our delightful shuffle comments. I'll add those back
for VPBLENDD next.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218322 91177308-0d34-0410-b5e6-96231b3b80d8
undef in the shuffle mask. This shows up when we're printing comments
during lowering and we still have an IR-level constant hanging around
that models undef.
A nice consequence of this is *much* prettier test cases where the undef
lanes actually show up as undef rather than as a particular set of
values. This also allows us to print shuffle comments in cases that use
undef such as the recently added variable VPERMILPS lowering. Now those
test cases have nice shuffle comments attached with their details.
The shuffle lowering for PSHUFB has been augmented to use undef, and the
shuffle combining has been augmented to comprehend it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218301 91177308-0d34-0410-b5e6-96231b3b80d8
trick that I missed.
VPERMILPS has a non-immediate memory operand mode that allows it to do
asymetric shuffles in the two 128-bit lanes. Use this rather than two
shuffles and a blend.
However, it turns out the variable shuffle path to VPERMILPS (and
VPERMILPD, although that one offers no functional differenc from the
immediate operand other than variability) wasn't even plumbed through
codegen. Do such plumbing so that we can reasonably emit
a variable-masked VPERMILP instruction. Also plumb basic comment parsing
and printing through so that the tests are reasonable.
There are still a few tests which don't show the shuffle pattern. These
are tests with undef lanes. I'll teach the shuffle decoding and printing
to handle undef mask entries in a follow-up. I've looked at the masks
and they seem reasonable.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218300 91177308-0d34-0410-b5e6-96231b3b80d8
td pattern). Currently we only model the immediate operand variation of
VPERMILPS and VPERMILPD, we should make that clear in the pseudos used.
Will be adding support for the variable mask variant in my next commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218282 91177308-0d34-0410-b5e6-96231b3b80d8
We generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors.
This patch should preserve all existing behavior with regular optimization levels,
but also use splats whenever possible when optimizing for *size* on any CPU with AVX or AVX2.
The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save
at least 8 bytes (up to 31 bytes) of constant pool data.
Differential Revision: http://reviews.llvm.org/D5347
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218263 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Update segmented-stacks*.ll tests with x32 target case and make
corresponding changes to make them pass.
Test Plan: tests updated with x32 target
Reviewers: nadav, rafael, dschuff
Subscribers: llvm-commits, zinovy.nis
Differential Revision: http://reviews.llvm.org/D5245
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218247 91177308-0d34-0410-b5e6-96231b3b80d8
a more sane approach to AVX2 support.
Fundamentally, there is no useful way to lower integer vectors in AVX.
None. We always end up with a VINSERTF128 in the end, so we might as
well eagerly switch to the floating point domain and do everything
there. This cleans up lots of weird and unlikely to be correct
differences between integer and floating point shuffles when we only
have AVX1.
The other nice consequence is that by doing things this way we will make
it much easier to write the integer lowering routines as we won't need
to duplicate the logic to check for AVX vs. AVX2 in each one -- if we
actually try to lower a 256-bit vector as an integer vector, we have
AVX2 and can rely on it. I think this will make the code much simpler
and more comprehensible.
Currently, I've disabled *all* support for AVX2 so that we always fall
back to AVX. This keeps everything working rather than asserting. That
will go away with the subsequent series of patches that provide
a baseline AVX2 implementation.
Please note, I'm going to implement AVX2 *without access to hardware*.
That means I cannot correctness test this path. I will be relying on
those with access to AVX2 hardware to do correctness testing and fix
bugs here, but as a courtesy I'm trying to sketch out the framework for
the new-style vector shuffle lowering in the context of the AVX2 ISA.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218228 91177308-0d34-0410-b5e6-96231b3b80d8
input v8f32 shuffles which are not 128-bit lane crossing but have
different shuffle patterns in the low and high lanes. This removes most
of the extract/insert traffic that was unnecessary and is particularly
good at lowering cases where only one of the two lanes is shuffled at
all.
I've also added a collection of test cases with undef lanes because this
lowering is somewhat more sensitive to undef lanes than others.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218226 91177308-0d34-0410-b5e6-96231b3b80d8
the new vector shuffle lowering no longer needs to check both symmetric
forms of UNPCK patterns for v4f64.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218217 91177308-0d34-0410-b5e6-96231b3b80d8
lowering when it can use a symmetric SHUFPS across both 128-bit lanes.
This required making the SHUFPS lowering tolerant of other vector types,
and adjusting our canonicalization to canonicalize harder.
This is the last of the clever uses of symmetry I've thought of for
v8f32. The rest of the tricks I'm aware of here are to work around
assymetry in the mask.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218216 91177308-0d34-0410-b5e6-96231b3b80d8
a generic vector shuffle mask into a helper that isn't specific to the
other things that influence which choice is made or the specific types
used with the instruction.
No functionality changed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218215 91177308-0d34-0410-b5e6-96231b3b80d8
of a single element into a zero vector for v4f64 and v4i64 in AVX.
Ironically, there is less to see here because xor+blend is so crazy fast
that we can't really beat that to zero the high 128-bit lane.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218214 91177308-0d34-0410-b5e6-96231b3b80d8
UNPCKHPS with AVX vectors by recognizing those patterns when they are
repeated for both 128-bit lanes.
With this, we now generate the exact same (really nice) code for
Quentin's avx_test_case.ll which was the most significant regression
reported for the new shuffle lowering. In fact, I'm out of specific test
cases for AVX lowering, the rest were AVX2 I think. However, there are
a bunch of pretty obvious remaining things to improve with AVX...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218213 91177308-0d34-0410-b5e6-96231b3b80d8
important bits of cleverness: to detect and lower repeated shuffle
patterns between the two 128-bit lanes with a single instruction.
This patch just teaches it how to lower single-input shuffles that fit
this model using VPERMILPS. =] There is more that needs to happen here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218211 91177308-0d34-0410-b5e6-96231b3b80d8
v8f32 shuffles in the new vector shuffle lowering code.
This is very cheap to do and makes it much more clear that anything more
expensive but overlapping with this lowering should be selected
afterward (for example using AVX2's VPERMPS). However, no functionality
changed here as without this code we would fall through to create no-op
shuffles of each input and a blend. =]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218209 91177308-0d34-0410-b5e6-96231b3b80d8
VBLENDPD over using VSHUFPD. While the 256-bit variant of VBLENDPD slows
down to the same speed as VSHUFPD on Sandy Bridge CPUs, it has twice the
reciprocal throughput on Ivy Bridge CPUs much like it does everywhere
for 128-bits. There isn't a downside, so just eagerly use this
instruction when it suffices.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218208 91177308-0d34-0410-b5e6-96231b3b80d8
awkward conditions. The readability improvement of this will be even
more important as I generalize it to handle more types.
No functionality changed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218205 91177308-0d34-0410-b5e6-96231b3b80d8
128-bit lane crossings, not 'half' crossings. This came up in code
review ages ago, but I hadn't really addresesd it. Also added some
documentation for the helper.
No functionality changed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218203 91177308-0d34-0410-b5e6-96231b3b80d8
actual support for complex AVX shuffling tricks. We can do independent
blends of the low and high 128-bit lanes of an avx vector, so shuffle
the inputs into place and then do the blend at 256 bits. This will in
many cases remove one blend instruction.
The next step is to permute the low and high halves in-place rather than
extracting them and re-inserting them.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218202 91177308-0d34-0410-b5e6-96231b3b80d8