Added target specific combine rules to fold blend intrinsics according
to the following rules:
1) fold(blend A, A, Mask) -> A;
2) fold(blend A, B, <allZeros>) -> A;
3) fold(blend A, B, <allOnes>) -> B.
Added two new tests to verify that the new folding rules work for all
the optimized blend intrinsics.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208895 91177308-0d34-0410-b5e6-96231b3b80d8
For example
tzcntl %edi, %ebx
testl %edi, %edi
je .label
can be rewritten into
tzcntl %edi, %ebx
jb .label
A minor complication is that tzcnt sets CF instead of ZF when the input
is zero, we have to rewrite users of the flags from ZF to CF. Currently
we recognize patterns using lzcnt, tzcnt and popcnt.
Differential Revision: http://reviews.llvm.org/D3454
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208788 91177308-0d34-0410-b5e6-96231b3b80d8
compared to 'AddrMode.BaseReg'. In the case that 'AddrMode.BaseReg' is
nullptr, 'Result' will also be nullptr, so the cast causes an assertion. We
should use dyn_cast_or_null here to check 'Result' is not null and it is an
instruction.
Bug found by Mats Petersson, and I reduced his IR to get a test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208705 91177308-0d34-0410-b5e6-96231b3b80d8
r208453 added support for having sret on the second parameter. In that
change, the code for copying sret into a virtual register was hoisted
into the loop that lowers formal parameters. This caused a "Wrong
topological sorting" assertion failure during scheduling when a
parameter is passed in memory. This change undoes that by creating a
second loop that deals with sret.
I'm worried that this fix is incomplete. I don't fully understand the
dependence issues. However, with this change we produce the same DAGs
we used to produce, so if they are broken, they are just as broken as
they have always been.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208637 91177308-0d34-0410-b5e6-96231b3b80d8
1) Changed gather and scatter intrinsics. Now they are aligned with GCC built-ins. There is no more non-masked form. Masked intrinsic receives -1 if all lanes are executed.
2) I changed the function that works with intrinsics inside X86ISelLowering.cpp. I put all intrinsics in one table. I did it for INTRINSICS_W_CHAIN and plan to put all intrinsics from WO_CHAIN set to the same table in order to avoid the long-long "switch". (I wanted to use static map initialization that allowed by C++11 but I wasn't able to compile it on VS2012).
3) I added gather/scatter prefetch intrinsics.
4) I fixed MRMm encoding for masked instructions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208522 91177308-0d34-0410-b5e6-96231b3b80d8
When lowering build_vector to an insertps, we would still lower it, even
if the source vectors weren't v4x32. This would break on avx if the source
was a v8x32. We now check the type of the source vectors.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208487 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r200561.
This calling convention was an attempt to match the MSVC C++ ABI for
methods that return structures by value. This solution didn't scale,
because it would have required splitting every CC available on Windows
into two: one for methods and one for free functions.
Now that we can put sret on the second arg (r208453), and Clang does
that (r208458), revert this hack.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208459 91177308-0d34-0410-b5e6-96231b3b80d8
MSVC always places the implicit sret parameter after the implicit this
parameter of instance methods. We used to handle this for
x86_thiscallcc by allocating the sret parameter on the stack and leaving
the this pointer in ecx, but that doesn't handle alternative calling
conventions like cdecl, stdcall, fastcall, or the win64 convention.
Instead, change the verifier to allow sret on the second parameter.
This also requires changing the Mips and X86 backends to return the
argument with the sret parameter, instead of assuming that the sret
parameter comes first.
The Sparc backend also returns sret parameters in a register, but I
wasn't able to update it to handle secondary sret parameters. It
currently calls report_fatal_error if you feed it an sret in the second
parameter.
Reviewers: rafael.espindola, majnemer
Differential Revision: http://reviews.llvm.org/D3617
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208453 91177308-0d34-0410-b5e6-96231b3b80d8
This patch teaches the backend how to combine packed SSE2/AVX2 arithmetic shift
intrinsics.
The rules are:
- Always fold a packed arithmetic shift by zero to its first operand;
- Convert a packed arithmetic shift intrinsic dag node into a ISD::SRA only if
the shift count is known to be smaller than the vector element size.
This patch also teaches to function 'getTargetVShiftByConstNode' how fold
target specific vector shifts by zero.
Added two new tests to verify that the DAGCombiner is able to fold
sequences of SSE2/AVX2 packed arithmetic shift calls.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208342 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.
Also optimize when we move one element in a vector to a different place
in that vector while zeroing out some of the other elements.
Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.
Added some tests for different cases where this optimization triggers.
Reviewers: nadav, delena, craig.topper
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3521
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208271 91177308-0d34-0410-b5e6-96231b3b80d8
Prior to r208252, the FMA 231 family was marked as isCommutable. However the
memory variants of this family are not commutable. Therefore, we did not
implemented the findCommutedOpIndices for those variants and missed that
the default implementation (more or less: commute indices 1 and 2) was
firing behind our back.
As a result, as demonstrated in the test case before the fix, we were
transforming a = b * c + a into a = a * c + b.
I.e., before r208252 we were generating for this test case:
vmovaps %xmm0, %xmm1
vmoss (%rsi), %xmm0
vfmadd231ss (%rdi), %xmm1, %xmm0
Instead of:
vmoss (%rsi), %xmm1
vfmadd231ss (%rdi), %xmm1, %xmm0
<rdar://problem/16800495>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208260 91177308-0d34-0410-b5e6-96231b3b80d8
Before this patch, the backend always emitted a store+load sequence to
bitconvert from f64 to i64 the input operand of a ISD::BITCAST dag node that
performed a bitconvert from type MVT::f64 to type MVT::v2i32. The resulting
i64 node was then used to build a v2i32 vector.
With this patch, the backend now produces a cheaper SCALAR_TO_VECTOR from
MVT::f64 to MVT::v2f64. That SCALAR_TO_VECTOR is then followed by a "free"
bitcast to type MVT::v4i32. The elements of the resulting
v4i32 are then extracted to build a v2i32 vector (which is illegal and
therefore promoted to MVT::v2i64).
This is in general cheaper than emitting a stack store+load sequence
to bitconvert the operand from type f64 to type i64.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208107 91177308-0d34-0410-b5e6-96231b3b80d8
This patch implements the infrastructure to use named register constructs in
programs that need access to specific registers (bare metal, kernels, etc).
So far, only the stack pointer is supported as a technology preview, but as it
is, the intrinsic can already support all non-allocatable registers from any
architecture.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208104 91177308-0d34-0410-b5e6-96231b3b80d8
The Win64 docs are very clear that anything larger than 8 bytes is
passed by reference, and GCC MinGW64 honors that for __modti3 and
friends.
Patch by Jameson Nash!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@208029 91177308-0d34-0410-b5e6-96231b3b80d8
Both MinGW and cygwin (i686) construct export directives without the global
leader prefix. This is mostly due to the fact that they use GNU ld which does
not correctly handle the export directive. This apparently has been been broken
for a while. However, this was recently reported as being broken by
mingwandroid and diorcety of the msys2 project.
Remove the global leader prefix if targeting MinGW or cygwin, otherwise, retain
the global leader prefix. Add an explicit test for cygwin's behaviour of export
directives.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207926 91177308-0d34-0410-b5e6-96231b3b80d8
Given the following C code llvm currently generates suboptimal code for
x86-64:
__m128 bss4( const __m128 *ptr, size_t i, size_t j )
{
float f = ptr[i][j];
return (__m128) { f, f, f, f };
}
=================================================
define <4 x float> @_Z4bss4PKDv4_fmm(<4 x float>* nocapture readonly %ptr, i64 %i, i64 %j) #0 {
%a1 = getelementptr inbounds <4 x float>* %ptr, i64 %i
%a2 = load <4 x float>* %a1, align 16, !tbaa !1
%a3 = trunc i64 %j to i32
%a4 = extractelement <4 x float> %a2, i32 %a3
%a5 = insertelement <4 x float> undef, float %a4, i32 0
%a6 = insertelement <4 x float> %a5, float %a4, i32 1
%a7 = insertelement <4 x float> %a6, float %a4, i32 2
%a8 = insertelement <4 x float> %a7, float %a4, i32 3
ret <4 x float> %a8
}
=================================================
shlq $4, %rsi
addq %rdi, %rsi
movslq %edx, %rax
vbroadcastss (%rsi,%rax,4), %xmm0
retq
=================================================
The movslq is uneeded, but is present because of the trunc to i32 and then
sext back to i64 that the backend adds for vbroadcastss.
We can't remove it because it changes the meaning. The IR that clang
generates is already suboptimal. What clang really should emit is:
%a4 = extractelement <4 x float> %a2, i64 %j
This patch makes that legal. A separate patch will teach clang to do it.
Differential Revision: http://reviews.llvm.org/D3519
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207801 91177308-0d34-0410-b5e6-96231b3b80d8
There is no need to check if we want to hoist the immediate value of an
shift instruction. Simply return TCC_Free right away.
This change is like r206101, but for X86.
rdar://problem/16190769
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207692 91177308-0d34-0410-b5e6-96231b3b80d8
Currently, musttail codegen is relying on sibcall optimization, and
reporting a fatal error if fails. Sibcall optimization fails when stack
arguments need to be modified, which is insufficient for musttail.
The logic for moving arguments in memory safely is already implemented
for GuaranteedTailCallOpt. This change merely arranges for musttail
calls to use it.
No functional change for GuaranteedTailCallOpt.
Reviewers: espindola
Differential Revision: http://reviews.llvm.org/D3493
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207598 91177308-0d34-0410-b5e6-96231b3b80d8
This gets us pretty code for divs of i16 vectors. Turn the existing
intrinsics into the corresponding nodes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207317 91177308-0d34-0410-b5e6-96231b3b80d8
Otherwise the legalizer would just scalarize everything. Support for
mulhi in the targets isn't that great yet so on most targets we get
exactly the same scalarized output. Add a test for x86 vector udiv.
I had to disable the mulhi nodes on ARM because there aren't any patterns
for it. As far as I know ARM has instructions for getting the high part of
a multiply so this should be fixed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207315 91177308-0d34-0410-b5e6-96231b3b80d8
The included test case would return the incorrect results, because the expansion
of an shift with a constant shift amount of 0 would generate undefined behavior.
This is because ExpandShiftByConstant assumes that all shifts by constants with
a value of 0 have already been optimized away. This doesn't happen for opaque
constants and usually this isn't a problem, because opaque constants won't take
this code path - they are not supposed to. In the case that the opaque constant
has to be expanded by the legalizer, the legalizer would drop the opaque flag.
In this case we hit the limitations of ExpandShiftByConstant and create incorrect
code.
This commit fixes the legalizer by not dropping the opaque flag when expanding
opaque constants and adding an assertion to ExpandShiftByConstant to catch this
not supported case in the future.
This fixes <rdar://problem/16718472>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207304 91177308-0d34-0410-b5e6-96231b3b80d8
Scaling factors are not free on X86 because every "complex" addressing mode
breaks the related instruction into 2 allocations instead of 1.
<rdar://problem/16730541>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207301 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
If we're doing a v4f32/v4i32 shuffle on x86 with SSE4.1, we can lower
certain shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory
operand.
Added tests for insertps optimizations on shufflevector.
Added support and tests for v4i32 vector optimization.
Reviewers: nadav
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3475
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207291 91177308-0d34-0410-b5e6-96231b3b80d8
This is similar to the 'tail' marker, except that it guarantees that
tail call optimization will occur. It also comes with convervative IR
verification rules that ensure that tail call optimization is possible.
Reviewers: nicholas
Differential Revision: http://llvm-reviews.chandlerc.com/D3240
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207143 91177308-0d34-0410-b5e6-96231b3b80d8
This patch:
- Adds two new X86 builtin intrinsics ('int_x86_rdtsc' and
'int_x86_rdtscp') as GCCBuiltin intrinsics;
- Teaches the backend how to lower the two new builtins;
- Introduces a common function to lower READCYCLECOUNTER dag nodes
and the two new rdtsc/rdtscp intrinsics;
- Improves (and extends) the existing x86 test 'rdtsc.ll'; now test 'rdtsc.ll'
correctly verifies that both READCYCLECOUNTER and the two new intrinsics
work fine for both 64bit and 32bit Subtargets.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207127 91177308-0d34-0410-b5e6-96231b3b80d8
This allows us to compile
return (mask & 0x8 ? a : b);
into
testb $8, %dil
cmovnel %edx, %esi
instead of
andl $8, %edi
shrl $3, %edi
cmovnel %edx, %esi
which we formed previously because dag combiner canonicalizes setcc of and into shift.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207088 91177308-0d34-0410-b5e6-96231b3b80d8
Emit the flag to indicate to the assembler that a section contains data if there
is pre-populated data present.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@207028 91177308-0d34-0410-b5e6-96231b3b80d8