the new shuffle lowering and an implementation for v4 shuffles.
This allows us to handle non-half-crossing shuffles directly for v4
shuffles, both integer and floating point. This currently misses places
where we could perform the blend via UNPCK instructions, but otherwise
generates equally good or better code for the test cases included to the
existing vector shuffle lowering. There are a few cases that are
entertainingly better. ;]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215702 91177308-0d34-0410-b5e6-96231b3b80d8
target-specific shuffl DAG combines.
We were recognizing the paired shuffles backwards. This code needs to be
replaced anyways as we have the same functionality elsewhere, but I'll
do the refactoring in a follow-up, this is the minimal fix to the
behavior.
In addition to fixing miscompiles with the new vector shuffle lowering,
it also causes the canonicalization to kick in much better, selecting
the smaller encoding variants in lots of places in the new AVX path.
This still isn't quite ideal as we don't need both the shufpd and the
punpck instructions, but that'll get fixed in a follow-up patch.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215690 91177308-0d34-0410-b5e6-96231b3b80d8
broken logic for merging shuffle masks in the face of SM_SentinelZero
mask operands.
While these are '-1' they don't mean 'undef' the way '-1' means in the
pre-legalized shuffle masks. Instead, they mean that the shuffle
operation is forcibly zeroing that lane. Reflect this and explicitly
handle it in a bunch of places. In one place the effect is equivalent
but much more clear. In the rest it was really weirdly broken.
Also, rewrite the entire merging thing to be a more directy operation
with a single loop and just doing math to map the indices through the
various masks.
Also add a bunch of asserts to try to make in extremely clear what the
different masks can possibly look like.
Finally, add some comments to clarify that we're merging shuffle masks
*up* here rather than *down* as we do everywhere else, and thus the
logic is quite confusing.
Thanks to several different people for sending test cases, and for
Robert Khasanov for an initial attempt at fixing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215687 91177308-0d34-0410-b5e6-96231b3b80d8
FastEmit_i won't always succeed to materialize an i32 constant and just fail.
This would trigger a fall-back to SelectionDAG, which is really not necessary.
This fix will first fall-back to a constant pool load to materialize the constant
before giving up for good.
This fixes <rdar://problem/18022633>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215682 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts:
r215595 "[FastISel][X86] Add large code model support for materializing floating-point constants."
r215594 "[FastISel][X86] Use XOR to materialize the "0" value."
r215593 "[FastISel][X86] Emit more efficient instructions for integer constant materialization."
r215591 "[FastISel][AArch64] Make use of the zero register when possible."
r215588 "[FastISel] Let the target decide first if it wants to materialize a constant."
r215582 "[FastISel][AArch64] Cleanup constant materialization code. NFCI."
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215673 91177308-0d34-0410-b5e6-96231b3b80d8
This patch allows a vector fneg of a bitcasted integer value to be optimized in the same way that we already optimize a scalar fneg. If the integer variable is a constant, we can precompute the result and not require any logic ops.
This patch is very similar to a fabs patch committed at r214892.
Differential Revision: http://reviews.llvm.org/D4852
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215646 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This is done by removing some hardcoded registers like $at or expecting a single digit register to be selected.
Contains work done by Matheus Almeida.
Reviewers: matheusalmeida, dsanders
Reviewed By: dsanders
Subscribers: tomatabacu
Differential Revision: http://reviews.llvm.org/D4227
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215640 91177308-0d34-0410-b5e6-96231b3b80d8
lowering scheme.
Currently, this just directly bails to the fallback path of splitting
the 256-bit vector into two 128-bit vectors, operating there, and then
joining the results back together. While the results are far from
perfect, they are *shockingly* good for what we're doing here. I'll be
layering the rest of the functionality on top of this piece by piece and
updating tests as I go.
Note that 256-bit vectors in this mode are still somewhat WIP. While
I think the code paths that I'm adding here are clean and good-to-go,
there are still a lot of 128-bit assumptions that I'll need to stomp out
as I march through the functional spread here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215637 91177308-0d34-0410-b5e6-96231b3b80d8
input node after manually adding it to the worklist and using CombineTo.
Once we use CombineTo the input node may have been deleted. Despite this
being *completely confusing* and somewhat broken, the only way to
"correctly" return from a DAG combine after potentially deleting the
input node is to return *that exact node*....
But really, this code should just never have used CombineTo. It won't do
what it wants (returning the node as mentioned above just causes the
combine to infloop). The correct way to combine away a casted load to
a load of the correct type is to RAUW the chain directly and then return
the loaded value to replace the actual value node.
I managed to find this with the vector shuffle fuzzer even though it
clearly has nothing at all to do with vector shuffles and rather those
happen to trigger a load of a constant pool that hits this combine *just
right*. I've included the test as it is small and a nice stress test
that the infrastructure isn't asserting.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215622 91177308-0d34-0410-b5e6-96231b3b80d8
combining by replacing it with something else but not re-process the
node afterward to remove it.
In a truly remarkable stroke of bad luck, this would (in the test case
attached) end up getting some other node combined into it without ever
getting re-processed. By adding it back on to the worklist, in addition
to deleting the dead nodes more quickly we also ensure that if it
*stops* being dead for any reason it makes it back through the
legalizer. Without this, the test case will end up failing during
instruction selection due to an and node with a type we don't have an
instruction pattern for.
It took many million runs of the shuffle fuzz tester to find this.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215611 91177308-0d34-0410-b5e6-96231b3b80d8
Certain functions such as objc_autoreleaseReturnValue have to be called as
tail-calls even at -O0. Since normal fast-isel doesn't emit calls as tail calls,
we have to fall back to SelectionDAG to select calls that are marked as tail.
<rdar://problem/17991614>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215600 91177308-0d34-0410-b5e6-96231b3b80d8
FastISel didn't take much advantage of the different addressing modes available
to it on AArch64. This commit allows the ComputeAddress method to recognize more
addressing modes that allows shifts and sign-/zero-extensions to be folded into
the memory operation itself.
For Example:
lsl x1, x1, #3 --> ldr x0, [x0, x1, lsl #3]
ldr x0, [x0, x1]
sxtw x1, w1
lsl x1, x1, #3 --> ldr x0, [x0, x1, sxtw #3]
ldr x0, [x0, x1]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215597 91177308-0d34-0410-b5e6-96231b3b80d8
In the large code model for X86 floating-point constants are placed in the
constant pool and materialized by loading from it. Since the constant pool
could be far away, a PC relative load might not work. Therefore we first
materialize the address of the constant pool with a movabsq and then load
from there the floating-point value.
Fixes <rdar://problem/17674628>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215595 91177308-0d34-0410-b5e6-96231b3b80d8
This mostly affects the i64 value type, which always resulted in an 15byte
mobavsq instruction to materialize any constant. The custom code checks the
value of the immediate and tries to use a different and smaller mov
instruction when possible.
This fixes <rdar://problem/17420988>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215593 91177308-0d34-0410-b5e6-96231b3b80d8
This change materializes now the value "0" from the zero register.
The zero register can be folded by several instruction, so no
materialization is need at all.
Fixes <rdar://problem/17924413>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215591 91177308-0d34-0410-b5e6-96231b3b80d8
This changes the order in which FastISel tries to materialize a constant.
Originally it would try to use a simple target-independent approach, which
can lead to the generation of inefficient code.
On X86 this would result in the use of movabsq to materialize any 64bit
integer constant - even for simple and small values such as 0 and 1. Also
some very funny floating-point materialization could be observed too.
On AArch64 it would materialize the constant 0 in a register even the
architecture has an actual "zero" register.
On ARM it would generate unnecessary mov instructions or not use mvn.
This change simply changes the order and always asks the target first if it
likes to materialize the constant. This doesn't fix all the issues
mentioned above, but it enables the targets to implement such
optimizations.
Related to <rdar://problem/17420988>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215588 91177308-0d34-0410-b5e6-96231b3b80d8
This change is also in preparation for a future change to make sure that
the constant materialization uses MOVT/MOVW when available and not a load
from the constant pool.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215584 91177308-0d34-0410-b5e6-96231b3b80d8
This for some reason fixes v1i64 kernel arguments on pre-SI. This
currently breaks some other cases in the kernel-args.ll test for R600,
but I'm not particularly confident in the new output. VTX_READ_* are not
used for some of the scalarized cases, and the code reading from the
constant buffer doesn't make much sense to me.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215564 91177308-0d34-0410-b5e6-96231b3b80d8
This patch improves the existing algorithm in DAGCombiner that
attempts to fold shuffles according to rule:
shuffle(shuffle(x, y, M1), undef, M2) -> shuffle(y, undef, M3)
Before this change, there were cases where the DAGCombiner conservatively
avoided folding shuffles even if the resulting mask would have been legal.
That is because the algorithm wrongly assumed that commuting
an illegal shuffle mask would always produce an illegal mask.
With this change, we now correctly compute the commuted shuffle mask before
calling method 'isShuffleMaskLegal' on it.
On X86, this improves for example the codegen for the following function:
define <4 x i32> @test(<4 x i32> %A, <4 x i32> %B) {
%1 = shufflevector <4 x i32> %B, <4 x i32> %A, <4 x i32> <i32 1, i32 2, i32 6, i32 7>
%2 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 2, i32 3>
ret <4 x i32> %2
}
Before this change the X86 backend (-mcpu=corei7) generated
the following assembly code for function @test:
shufps $-23, %xmm0, %xmm1 # xmm1 = xmm1[1,2],xmm0[2,3]
movhlps %xmm1, %xmm1 # xmm1 = xmm1[1,1]
movaps %xmm1, %xmm0
Now we produce:
movhlps %xmm0, %xmm0 # xmm0 = xmm0[1,1]
Added extra test cases in combine-vec-shuffle-2.ll to verify that we correctly
fold according to the above-mentioned rule.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215555 91177308-0d34-0410-b5e6-96231b3b80d8
Added avx512_movnt_vl multiclass for handling 256/128-bit forms of instruction.
Added encoding and lowering tests.
Reviewed by Elena Demikhovsky <elena.demikhovsky@intel.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215536 91177308-0d34-0410-b5e6-96231b3b80d8
one pesky test case correctly.
This test case caused the old code to infloop occilating between solving
the low-half and the high-half. The 'side balancing' part of
single-input v8 shuffle lowering didn't handle the one pattern which can
cause it to occilate. Fortunately the fuzz testing found this case.
Unfortuately it was *terrible* to handle. I'm really sorry for the
amount and density of the code here, I'd love suggestions on how to
simplify it. I feel like there *must* be a simpler form here, but after
a lot of days I've not found it. This is the only one I've found that
even works. I've added the one pesky test case along with some nice
comments explaining the core problem that we have to solve here.
So far this has survived approximately 32k test cases. More strenuous
fuzzing commencing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215519 91177308-0d34-0410-b5e6-96231b3b80d8
This implements PPCTargetLowering::getTgtMemIntrinsic for Altivec load/store
intrinsics. As with the construction of the MachineMemOperands for the
intrinsic calls used for unaligned load/store lowering, the only slight
complication is that we need to represent a larger memory range than the
loaded/stored value-type size (because the address is rounded down to an
aligned address, and we need to conservatively represent the entire possible
range of the actual access). This required adding an extra size field to
TargetLowering::IntrinsicInfo, and this was done in a way that required no
modifications to other targets (the size defaults to the store size of the
provided memory data type).
This fixes test/CodeGen/PowerPC/unal-altivec-wint.ll (so it can be un-XFAILed).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215512 91177308-0d34-0410-b5e6-96231b3b80d8
Unfortunately, our use of the SDNode class hierarchy for INTRINSIC_W_CHAIN and
INTRINSIC_VOID nodes is somewhat broken right now. These nodes sometimes are
used for memory intrinsics (those with MachineMemOperands), and sometimes not.
When not, the nodes are not created as instances of MemIntrinsicSDNode, but
rather created as some other subclass of SDNode using DAG::getNode. When they
are memory intrinsics, they are created using DAG::getMemIntrinsicNode as
instances of MemIntrinsicSDNode. MemIntrinsicSDNode is a subclass of
MemSDNode, but prior to r214452, we had a non-self-consistent setup whereby
MemIntrinsicSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID would
return true but MemSDNode::classof on INTRINSIC_W_CHAIN and INTRINSIC_VOID
would return false. In r214452, MemSDNode::classof was changed to return true
for INTRINSIC_W_CHAIN and INTRINSIC_VOID, which is now self-consistent. The
problem is that neither the pre-r214452 logic and the post-r214452 logic are
really right. The truth is that not all INTRINSIC_W_CHAIN and INTRINSIC_VOID
nodes are instances of MemIntrinsicSDNode (or MemSDNode for that matter), and
the return value from classof needs to reflect that. This was broken before
r214452 (because MemIntrinsicSDNode::classof always returned true), and was
broken afterward (because MemSDNode::classof also always returned true), and
will now be correct.
The minimal solution is to grab one of the SubclassData bits (there is one left
for MemIntrinsicSDNode nodes) and use it to store whether or not a particular
INTRINSIC_W_CHAIN or INTRINSIC_VOID is really an instance of
MemIntrinsicSDNode or not. Doing this allows both MemIntrinsicSDNode::classof
and MemSDNode::classof to return the correct answer for the underlying object
for both the memory-intrinsic and non-memory-intrinsic cases.
This fixes the problem that r214452 created in the SelectionDAGDumper (thanks
to Matt Arsenault for pointing it out).
Because PowerPC does not implement getTgtMemIntrinsic, this change breaks
test/CodeGen/PowerPC/unal-altivec-wint.ll. I've XFAILed it for now, and will
fix it in a follow-up commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215511 91177308-0d34-0410-b5e6-96231b3b80d8
I think that this will scale better in most cases than adding a Pat<> for each
mapping from the intrinsic DAG to the intruction (i.e. rri, rrik, rrikz). We
can just lower to the SDNode and have the resulting DAG be matches by the DAG
patterns.
Alternatively (long term), we could keep the Pat<>s but generate them via the
new AVX512_masking multiclass. The difficulty is that in order to formulate
that we would have to concatenate DAGs. Currently this is only supported if
the operators of the input DAGs are identical.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215473 91177308-0d34-0410-b5e6-96231b3b80d8
v2: drop enum keyword
use correct extension mode
don't bother computing the sign in unsinged case
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215462 91177308-0d34-0410-b5e6-96231b3b80d8
v2: add tests
rename LowerSDIV24 to LowerSDIVREM24
handle the rem part in this function
Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215460 91177308-0d34-0410-b5e6-96231b3b80d8
The combiner ignored DBG nodes when checking
the uses of a virtual register.
It combined a sequence like
%vreg1 = madd %vreg2, %vreg3,...
DBG_VALUE (%vreg1 ...)
%vreg4 = add %vreg1,...
to
%vreg4 = madd %vreg2, %vreg3
leaving behind a dangling DBG_VALUE with
a definition. This triggered an assertion
in the MachineTraceMetrics.cpp module.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215431 91177308-0d34-0410-b5e6-96231b3b80d8
This saves us from having to copy a 64-bit 0 value into VGPRs for
BUFFER_* instruction which only have a 12-bit immediate offset.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215399 91177308-0d34-0410-b5e6-96231b3b80d8
There are no variable values like registers encoded in the low 32 bits of MUBUF
instructions, so it is relatively easy to check these bits, and it will
help prevent us from introducing encoding bugs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215397 91177308-0d34-0410-b5e6-96231b3b80d8
This bit was left uninitialized, which was causing some random failures
of piglit tests.
NOTE: This is a candidate for the 3.5 branch.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215396 91177308-0d34-0410-b5e6-96231b3b80d8
For many Thumb-1 register register instructions, setting the CPSR is not
permitted inside an IT block. We would not correctly flag those instructions.
The previous change to identify this scenario was insufficient as it did not
actually catch all the instances. The current list is formed by manual
inspection of the ARMv6M ARM.
The change to the Thumb2 IT block test is due to the fact that the new more
stringent checking of the MIs results in the If Conversion pass being prevented
from executing (since not all the instructions in the BB are predicable). This
results in code gen changes.
Thanks to Tim Northover for pointing out that the previous patch was
insufficient and hinting that the use of the v6M ARM would be much easier to use
than the v7 or v8!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215382 91177308-0d34-0410-b5e6-96231b3b80d8
By default, LLVM uses the "C" calling convention for all runtime
library functions. The half-precision FP conversion functions use the
soft-float calling convention, and are needed for some targets which
use the hard-float convention by default, so must have their calling
convention explicitly set.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215348 91177308-0d34-0410-b5e6-96231b3b80d8
be propagated to all its users, and this propagation could increase the
probability of finding common subexpressions. If the COPY has only one user,
the COPY itself can be removed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215344 91177308-0d34-0410-b5e6-96231b3b80d8
The ARM ARM states that CPSR may not be updated by a MUL in thumb mode. Due to
an ordering of Thumb 2 Size Reduction and If Conversion, we would end up
generating a THUMB MULS inside an IT block.
The If Conversion pass uses the TTI isPredicable method to ensure that it can
transform a Basic Block. However, because we only check for IT handling on
Thumb2 functions, we may miss some cases. Even then, it only validates that the
CPSR is not *live* rather than it is not accessed. This corrects the handling
for that particular case since the same restriction does not hold on the vast
majority of the instructions.
This does prevent the IfConversion optimization from kicking in in certain
cases, but generating correct code is more valuable. Addresses PR20555.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215328 91177308-0d34-0410-b5e6-96231b3b80d8
These tests were using SI-NOT: MOVREL to make sure concat vectors
weren't being lowered to stack loads and stores, but we are using
scratch buffers for the stack now instead of registers, so we need
to add an additional SI-NOT check for scratch buffers.
With this change I was able to uncover one broken test which will
be fixed in a future commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215269 91177308-0d34-0410-b5e6-96231b3b80d8
I accidentally also used INC/DEC for unsigned arithmetic which doesn't work,
because INC/DEC don't set the required flag which is used for the overflow
check.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@215237 91177308-0d34-0410-b5e6-96231b3b80d8