Commit Graph

5519 Commits

Author SHA1 Message Date
Robert Khasanov
edf556ec1f [x86] Simplify vector selection if condition value type matches vselect value type and true value is all ones or false value is all zeros.
This transformation worked if selector is produced by SETCC, however SETCC is needed only if we consider to swap operands. So I replaced SETCC check for this case.
Added tests for vselect of <X x i1> values.


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220777 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-28 15:59:40 +00:00
Robert Khasanov
4a52493457 [AVX512] Bring back vector-shuffle lowering support through broadcasts
Ffter commit at rev219046 512-bit broadcasts lowering become non-optimal. Most of tests on broadcasting and embedded broadcasting were changed and they doesn’t produce efficient code.

Example below is from commit changes (it’s the first test from test/CodeGen/X86/avx512-vbroadcast.ll):

 define   <16 x i32> @_inreg16xi32(i32 %a) {
 ; CHECK-LABEL: _inreg16xi32:
 ; CHECK:       ## BB#0:
-; CHECK-NEXT:    vpbroadcastd %edi, %zmm0
+; CHECK-NEXT:    vmovd %edi, %xmm0
+; CHECK-NEXT:    vpbroadcastd %xmm0, %ymm0
+; CHECK-NEXT:    vinserti64x4 $1, %ymm0, %zmm0, %zmm0
 ; CHECK-NEXT:    retq
 %b = insertelement <16 x i32> undef, i32 %a, i32 0
 %c = shufflevector <16 x i32> %b, <16 x i32> undef, <16 x i32> zeroinitializer
 ret <16 x i32> %c
}

Here, 256-bit broadcast was generated instead of 512-bit one.

In this patch
1) I added vector-shuffle lowering through broadcasts
2) Removed asserts and branches likes because this is incorrect
-  assert(Subtarget->hasDQI() && "We can only lower v8i64 with AVX-512-DQI");
3) Fixed lowering tests


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220774 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-28 12:28:51 +00:00
Reid Kleckner
d5de327da0 X86: Implement the vectorcall calling convention
This is a Microsoft calling convention that supports both x86 and x86_64
subtargets. It passes vector and floating point arguments in XMM0-XMM5,
and passes them indirectly once they are consumed.

Homogenous vector aggregates of up to four elements can be passed in
sequential vector registers, but this part is not implemented in LLVM
and will be handled in Clang.

On 32-bit x86, it is similar to fastcall in that it uses ecx:edx as
integer register parameters and is callee cleanup. On x86_64, it
delegates to the normal win64 calling convention.

Reviewers: majnemer

Differential Revision: http://reviews.llvm.org/D5943

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220745 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-28 01:29:26 +00:00
Pete Cooper
68aeef61f4 Fix a stackmap bug introduced in r220710.
For a call to not return in to the stackmap shadow, the shadow must end with the call.

To do this, we must insert any required nops *before* the call, and not after it.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220728 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-27 22:38:45 +00:00
Pete Cooper
7476f9c513 Stackmap shadows should consider call returns a branch target.
To avoid emitting too many nops, a stackmap shadow can include emitted instructions in the shadow, but these must not include branch targets.

A return from a call should count as a branch target as patching over the instructions after the call would lead to incorrect behaviour for threads currently making that call, when they return.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220710 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-27 19:40:35 +00:00
Simon Pilgrim
44efa200e2 [X86][SSE] Bitcast assertion in XFormVExtractWithShuffleIntoLoad
Minor patch to fix an issue in XFormVExtractWithShuffleIntoLoad where a load is unary shuffled, then bitcast (to a type with the same number of elements) before extracting an element.

An undef was created for the second shuffle operand using the original (post-bitcasted) vector type instead of the pre-bitcasted type like the rest of the shuffle node - this was then causing an assertion on the different types later on inside SelectionDAG::getVectorShuffle.

Differential Revision: http://reviews.llvm.org/D5917



git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220592 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-24 21:04:41 +00:00
Sanjay Patel
51fa1bcef3 Allow AVX vrsqrtps generation.
This is a follow-on to r220570 that allows a 256-bit (v8f32)
version of vrsqrtps to be generated.



git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220579 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-24 17:59:18 +00:00
Sanjay Patel
a46f06efe2 Use rsqrt (X86) to speed up reciprocal square root calcs
This is a first step for generating SSE rsqrt instructions for
reciprocal square root calcs when fast-math is allowed.

For now, be conservative and only enable this for AMD btver2
where performance improves significantly - for example, 29%
on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
(if we convert the data type to single-precision float).

This patch adds a two constant version of the Newton-Raphson
refinement algorithm to DAGCombiner that can be selected by any target
via a parameter returned by getRsqrtEstimate()..

See PR20900 for more details:
http://llvm.org/bugs/show_bug.cgi?id=20900

Differential Revision: http://reviews.llvm.org/D5658



git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220570 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-24 17:02:16 +00:00
Tim Northover
c0ecebb32b ScheduleDAG: record PhysReg dependencies represented by CopyFromReg nodes
x86's CMPXCHG -> EFLAGS consumer wasn't being recorded as a real EFLAGS
dependency because it was represented by a pair of CopyFromReg(EFLAGS) ->
CopyToReg(EFLAGS) nodes. ScheduleDAG was expecting the source to be an
implicit-def on the instruction, where the result numbers in the DAG and the
Uses list in TableGen matched up precisely.

The Copy notation seems much more robust, so this patch extends ScheduleDAG
rather than refactoring x86.

Should fix PR20376.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220529 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-23 22:31:48 +00:00
Ahmed Bougacha
4d962b05ca [X86] Improve mul w/ overflow codegen, to MUL8+SETO.
Currently, @llvm.smul.with.overflow.i8 expands to 9 instructions, where
3 are really needed.

This adds X86ISD::UMUL8/SMUL8 SD nodes, and custom lowers them to
MUL8/IMUL8 + SETO.

i8 is a special case because there is no two/three operand variants of
(I)MUL8, so the first operand and return value need to go in AL/AX.

Also, we can't write patterns for these instructions: TableGen refuses
patterns where output operands don't match SDNode results. In this case,
instructions where the output operand is an implicitly defined register.

A related special case (and FIXME) exists for MUL8 (X86InstrArith.td):

  // FIXME: Used for 8-bit mul, ignore result upper 8 bits.
  // This probably ought to be moved to a def : Pat<> if the
  // syntax can be accepted.
  [(set AL, (mul AL, GR8:$src)), (implicit EFLAGS)]

Ideally, these go away with UMUL8, but we still need to improve TableGen
support of implicit operands in patterns.

Before this change:
  movsbl  %sil, %eax
  movsbl  %dil, %ecx
  imull   %eax, %ecx
  movb    %cl, %al
  sarb    $7, %al
  movzbl  %al, %eax
  movzbl  %ch, %esi
  cmpl    %eax, %esi
  setne   %al

After:
  movb    %dil, %al
  imulb   %sil
  seto    %al

Also, remove a made-redundant testcase for PR19858, and enable more FastISel
ALU-overflow tests for SelectionDAG too.

Differential Revision: http://reviews.llvm.org/D5809


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220516 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-23 21:55:31 +00:00
Reid Kleckner
e852ac7772 Revert "Don't count inreg params when mangling fastcall functions"
This reverts commit r214981.

I'm not sure what I was thinking when I wrote this. Testing with MSVC
shows that this function is mangled to '@f@8':
  int __fastcall f(int a, int b);

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220492 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-23 17:50:42 +00:00
Matt Arsenault
015776f38c Add minnum / maxnum codegen
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220342 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-21 23:01:01 +00:00
Rafael Espindola
45968c54e9 Fix a bit of confusion about .set and produce more readable assembly.
Every target we support has support for assembly that looks like

a = b - c
.long a

What is special about MachO is that the above combination suppresses the
production of a relocation.

With this change we avoid producing the intermediary labels when they don't
add any value.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220256 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-21 01:17:30 +00:00
Quentin Colombet
a37862e2de [X86] Fix a bug in the lowering of the mask of VSELECT.
X86 code to lower VSELECT messed a bit with the bits set in the mask of VSELECT
when it knows it can be lowered into BLEND. Indeed, only the high bits need to be
set for those and it optimizes those accordingly.
However, when the mask is a compile time constant, the lowering will be handled
by the generic optimizer and those modifications will generate bad code in the
generic optimizer.

This patch fixes that by preventing the optimization if the VSELECT will be
handled by the generic optimizer.

<rdar://problem/18675020>


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220242 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-20 23:13:30 +00:00
Simon Pilgrim
0d1978b813 [X86] Memory folding for commutative instructions (updated)
This patch improves support for commutative instructions in the x86 memory folding implementation by attempting to fold a commuted version of the instruction if the original folding fails - if that folding fails as well the instruction is 're-commuted' back to its original order before returning.

Updated version of r219584 (reverted in r219595) - the commutation attempt now explicitly ensures that neither of the commuted source operands are tied to the destination operand / register, which was the source of all the regressions that occurred with the original patch attempt.

Added additional regression test case provided by Joerg Sonnenberger.

Differential Revision: http://reviews.llvm.org/D5818



git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220239 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-20 22:14:22 +00:00
Pete Cooper
185a992dcf Check for dynamic alloca's when selecting lifetime intrinsics.
TL;DR: Indexing maps with [] creates missing entries.

The long version:

When selecting lifetime intrinsics, we index the *static* alloca map with the AllocaInst we find for that lifetime.  Trouble is, we don't first check to see if this is a dynamic alloca.

On the attached example, this causes a dynamic alloca to create an entry in the static map, and returns 0 (the default) as the frame index for that lifetime.  0 was used for the frame index of the stack protector, which given that it now has a lifetime, is coloured, and merged with other stack slots.

PEI would later trigger an assert because it expects the stack protector to not be dead.

This fix ensures that we only get frame indices for static allocas, ie, those in the map.  Dynamic ones are effectively dropped, which is suboptimal, but at least isn't completely broken.

rdar://problem/18672951

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220099 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-17 22:59:33 +00:00
Juergen Ributzka
32ef68718d [Stackmaps] Enable invoking the patchpoint intrinsic.
Patch by Kevin Modzelewski
Reviewers: atrick, ributzka
Reviewed By: ributzka
Subscribers: llvm-commits, reames

Differential Revision: http://reviews.llvm.org/D5634

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220055 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-17 17:39:00 +00:00
Andrea Di Biagio
5512b50db5 [X86] Fix missed selection of non-temporal store of zero vector.
When the input to a store instruction was a zero vector, the backend
always selected a normal vector store regardless of the non-temporal
hint. This is fixed by this patch.

This fixes PR19370.


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220054 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-17 17:27:06 +00:00
Rafael Espindola
2f8f1d34e3 Delete -std-compile-opts.
These days -std-compile-opts was just a silly alias for -O3.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219951 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-16 20:00:02 +00:00
Adam Nemet
fb9d61a8d6 [AVX512] Add DQ subvector inserts
In AVX512f we support 64x2 and 32x8 inserts via matching them to 32x4 and 64x4
respectively.  These are matched by "Alt" Pat<>'s (Alt stands for alternative
VTs).

Since DQ has native support for these intructions, I peeled off the non-"Alt"
part of the baseclass into vinsert_for_size_no_alt. The DQ instructions are
derived from this multiclass.  The "Alt" Pat<>'s are disabled with DQ.

Fixes <rdar://problem/18426089>

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219874 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-15 23:42:17 +00:00
Adam Nemet
ec7f30662e [AVX512] Add SKX testing to avx512-insert-extract.ll
This is in preparation to adding DQ subvector inserts to this testcase.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219873 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-15 23:42:14 +00:00
Adam Nemet
f9e3a3afa1 [AVX512] Fix test to produce a defined value
We're inserting into a 8 wide vector, so the index should be < 8.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219872 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-15 23:42:11 +00:00
Jingyue Wu
75d77cd179 [MachineSink] Use the real post dominator tree
Summary:
Fixes a FIXME in MachineSinking. Instead of using the simple heuristics in
isPostDominatedBy, use the real MachinePostDominatorTree and MachineLoopInfo.
The old heuristics caused instructions to sink unnecessarily, and might create
register pressure.

This is the second try of the fix. The first one (D4814) caused a performance
regression due to failing to sink instructions out of loops (PR21115). This
patch fixes PR21115 by sinking an instruction from a deeper loop to a shallower
one regardless of whether the target block post-dominates the source.

Thanks Alexey Volkov for reporting PR21115! 

Test Plan:
Added a NVPTX codegen test to verify that our change prevents the backend from
over-sinking. It also shows the unnecessary register pressure caused by
over-sinking.

Added an X86 test to verify we can sink instructions out of loops regardless of
the dominance relationship. This test is reduced from Alexey's test in PR21115.

Updated an affected test in X86.

Also ran SPEC CINT2006 and llvm-test-suite for compilation time and runtime
performance. Results are attached separately in the review thread.

Reviewers: Jiangning, resistor, hfinkel

Reviewed By: hfinkel

Subscribers: hfinkel, bruno, volkalexey, llvm-commits, meheff, eliben, jholewinski

Differential Revision: http://reviews.llvm.org/D5633

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219773 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-15 03:27:43 +00:00
Simon Pilgrim
84a3feea38 [X86][SSE] pslldq/psrldq shuffle mask decodes
Patch to provide shuffle decodes and asm comments for the sse pslldq/psrldq SSE2/AVX2 byte shift instructions.

Differential Revision: http://reviews.llvm.org/D5598


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219738 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-14 22:31:34 +00:00
Filipe Cabecinhas
40251eb0b0 Fix a broadcast related regression on the vector shuffle lowering.
Summary: Test by Robert Lougher!

Reviewers: chandlerc

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5745

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219617 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-13 16:16:16 +00:00
NAKAMURA Takumi
58c0f65bf2 Revert r219584, "[X86] Memory folding for commutative instructions."
It broke i686 selfhosting.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219595 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-13 04:17:34 +00:00
Simon Pilgrim
c00cd8e3c8 [X86] Memory folding for commutative instructions.
This patch improves support for commutative instructions in the x86 memory folding implementation by attempting to fold a commuted version of the instruction if the original folding fails - if that folding fails as well the instruction is 're-commuted' back to its original order before returning.

This mainly helps the stack inliner better fold reloads of 3 (or more) operand instructions (VEX encoded SSE etc.) but by performing this in the lowest foldMemoryOperandImpl implementation it also replaces the X86InstrInfo::optimizeLoadInstr version and is now used by FastISel too.

Differential Revision: http://reviews.llvm.org/D5701


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219584 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-12 10:52:55 +00:00
NAKAMURA Takumi
e9bc1e8263 llvm/test/CodeGen: Some tests don't REQUIRE asserts any more. Remove them.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219581 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-12 06:47:47 +00:00
Hal Finkel
d3aa46a1bc [MiSched] Fix a logic error in tryPressure()
Fixes a logic error in the MachineScheduler found by Steve Montgomery (and
confirmed by Andy). This has gone unfixed for months because the fix has been
found to introduce some small performance regressions. However, Andy has
recommended that, at this point, we fix this to avoid further dependence on the
incorrect behavior (and then follow-up separately on any regressions), and I
agree.

Fixes PR18883.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219512 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-10 17:06:20 +00:00
Adam Nemet
fbd0e464dd [AVX512] Intrinsics for vextract*x4
This adds the Pat<>'s for the intrinsics.  These are necessary because we
don't lower these intrinsics to SDNodes but match them directly.  See the
rational in the previous commit.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219362 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-08 23:25:37 +00:00
Robin Morisset
b79d91ca1c [X86] Don't transform atomic-load-add into an inc/dec when inc/dec is slow
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219357 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-08 23:16:23 +00:00
Robin Morisset
48dfa127d7 [X86] Avoid generating inc/dec when slow for x.atomic_store(1 + x.atomic_load())
Summary:
I had forgotten to check for NotSlowIncDec in the patterns that can generate
inc/dec for the above pattern (added in D4796).
This currently applies to Atom Silvermont, KNL and SKX.

Test Plan: New checks on atomic_mi.ll

Reviewers: jfb, nadav

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5677

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219336 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-08 19:38:18 +00:00
Robert Khasanov
0e3754615e [AVX512] Added intrinsics for 128-, 256- and 512-bit versions of VPCMP/VPCMPU{BWDQ}
Added CMP_MASK_CC intrinsic type.
Added tests for intrinsics.

Patch by Sergey Lisitsyn <sergey.lisitsyn@intel.com>


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219316 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-08 15:49:26 +00:00
Robin Morisset
28127ffb51 [X86] Fix a bug with fetch_add(INT32_MIN)
Summary:
Fix pr21099

The pseudocode of what we were doing (spread through two functions) was:
if (operand.doesNotFitIn32Bits())
  Opc.initializeWithFoo();
if (operand < 0)
  operand = -operand;
if (operand.doesFitIn8Bits())
  Opc.initializeWithBar();
else if (operand.doesFitIn32Bits())
  Opc.initializeWithBlah();
doStuff(Opc);

So for operand == INT32_MIN, Opc was never initialized because the operand changes
from fitting in 32 bits to not fitting, causing the various bugs/error messages
noted by pr21099.

This patch adds an extra test at the beginning for this case, and an
llvm_unreachable to have better error message if the operand ends up
not fitting in 32-bits at the end.

Test Plan: new test + make check

Reviewers: jfb

Subscribers: llvm-commits

Differential Revision: http://reviews.llvm.org/D5655

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219257 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-07 23:53:57 +00:00
Hal Finkel
807111a8f4 [DAGCombine] Remove SIGN_EXTEND-related inf-loop
The patch's author points out that, despite the function's documentation,
getSetCCResultType is only used to get the SETCC result type (with one
here-removed problematic exception). In one case, getSetCCResultType was being
used to get the predicate type to use for a SELECT node, and then
SIGN_EXTENDing (or truncating) to get the input predicate to match that type.
Unfortunately, this was happening inside visitSIGN_EXTEND, and creating new
SIGN_EXTEND nodes was causing an infinite loop. In addition, this behavior was
wrong if a target was not using ZeroOrNegativeOneBooleanContent. Lastly, the
extension/truncation seems unnecessary here: SELECT is defined as:

  Select(COND, TRUEVAL, FALSEVAL). If the type of the boolean COND is not i1
  then the high bits must conform to getBooleanContents.

So here we remove this use of getSetCCResultType and update
getSetCCResultType's documentation to reflect its actual uses.

Patch by deadal nix!

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219141 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-06 20:19:47 +00:00
Chandler Carruth
1d02acb7a0 [x86] Remove the 2-addr-to-3-addr "optimization" from shufps to pshufd.
This trades a (register-renamer-friendly) movaps for a floating point
/ integer domain cross. That is a very bad trade, even on architectures
where domain crossing is relatively fast. On any chip where there is
even a cycle stall, this is a Very Bad Idea. It doesn't even seem likely
to cause a spill to be introduced because the reason for the copy is to
destructively shuffle in place.

Thanks to Ben Kramer for fixing a bug in this code that my new shuffle
lowering exposed and highlighting that perhaps it should just go away.
=]

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219090 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 22:57:31 +00:00
Chandler Carruth
560bddce20 [x86, dag] Teach the DAG combiner to prune inputs toa vector_shuffle
that are unused.

This allows the combiner to delete math feeding shuffles where the math
isn't actually necessary. This improves some of the vperm2x128 tests
that regressed when the vector shuffle lowering started actually
generating vperm instructions rather than forcibly decomposing them.

Sadly, this isn't enough to get this *really* right because we still
form a completely unnecessary permutation. To fix that, we also need to
fold shuffles which just rearrange concatenated or inserted subvectors.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219086 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 19:14:34 +00:00
Benjamin Kramer
88b3a52eec X86: Don't drop half of the mask when converting 2-address shufps into 3-address pshufd.
It's debatable whether this transform is useful at all, but for now make sure
we don't generate invalid asm.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219084 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 16:14:29 +00:00
Elena Demikhovsky
a0cb2c75b0 AVX-512-SKX: Added instruction VPMOVM2B/W/D/Q.
This instruction allows to broadacst mask vector to data vector.


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219083 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 14:11:08 +00:00
Chandler Carruth
30ae72f02c [x86] Fix PR21139, one of the last remaining regressions found in the
new vector shuffle lowering.

This is loosely based on a patch by Marius Wachtler to the PR (thanks!).
I refactored it a bi to use std::count_if and a mutable array ref but
the core idea was exactly right. I also added some direct testing of
this case.

I believe PR21137 is now the only remaining regression.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219081 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 12:07:34 +00:00
Chandler Carruth
a644b090de [x86] Teach the new vector shuffle lowering how to lower 128-bit
shuffles using AVX and AVX2 instructions. This fixes PR21138, one of the
few remaining regressions impacting benchmarks from the new vector
shuffle lowering.

You may note that it "regresses" many of the vperm2x128 test cases --
these were actually "improved" by the naive lowering that the new
shuffle lowering previously did. This regression gave me fits. I had
this patch ready-to-go about an hour after flipping the switch but
wasn't sure how to have the best of both worlds here and thought the
correct solution might be a completely different approach to lowering
these vector shuffles.

I'm now convinced this is the correct lowering and the missed
optimizations shown in vperm2x128 are actually due to missing
target-independent DAG combines. I've even written most of the needed
DAG combine and will submit it shortly, but this part is ready and
should help some real-world benchmarks out.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219079 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-05 11:41:36 +00:00
Chandler Carruth
9e8c2e8568 [x86] Slap a triple on this test since it is poking around at the stack
and calling conventions. Otherwise its too hard to craft a usefully
generic set of assertions.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219047 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-04 04:22:55 +00:00
Chandler Carruth
03a77831cc [x86] Enable the new vector shuffle lowering by default.
Update the entire regression test suite for the new shuffles. Remove
most of the old testing which was devoted to the old shuffle lowering
path and is no longer relevant really. Also remove a few other random
tests that only really exercised shuffles and only incidently or without
any interesting aspects to them.

Benchmarking that I have done shows a few small regressions with this on
LNT, zero measurable regressions on real, large applications, and for
several benchmarks where the loop vectorizer fires in the hot path it
shows 5% to 40% improvements for SSE2 and SSE3 code running on Sandy
Bridge machines. Running on AMD machines shows even more dramatic
improvements.

When using newer ISA vector extensions the gains are much more modest,
but the code is still better on the whole. There are a few regressions
being tracked (PR21137, PR21138, PR21139) but by and large this is
expected to be a win for x86 generated code performance.

It is also more correct than the code it replaces. I have fuzz tested
this extensively with ISA extensions up through AVX2 and found no
crashes or miscompiles (yet...). The old lowering had a few miscompiles
and crashers after a somewhat smaller amount of fuzz testing.

There is one significant area where the new code path lags behind and
that is in AVX-512 support. However, there was *extremely little*
support for that already and so this isn't a significant step backwards
and the new framework will probably make it easier to implement lowering
that uses the full power of AVX-512's table-based shuffle+blend (IMO).

Many thanks to Quentin, Andrea, Robert, and others for benchmarking
assistance. Thanks to Adam and others for help with AVX-512. Thanks to
Hal, Eric, and *many* others for answering my incessant questions about
how the backend actually works. =]

I will leave the old code path in the tree until the 3 PRs above are at
least resolved to folks' satisfaction. Then I will rip it (and 1000s of
lines of code) out. =] I don't expect this flag to stay around for very
long. It may not survive next week.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219046 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-04 03:52:55 +00:00
Chandler Carruth
f159de96bd [x86] Add a really preposterous number of patterns for matching all of
the various ways in which blends can be used to do vector element
insertion for lowering with the scalar math instruction forms that
effectively re-blend with the high elements after performing the
operation.

This then allows me to bail on the element insertion lowering path when
we have SSE4.1 and are going to be doing a normal blend, which in turn
restores the last of the blends lost from the new vector shuffle
lowering when I got it to prioritize insertion in other cases (for
example when we don't *have* a blend instruction).

Without the patterns, using blends here would have regressed
sse-scalar-fp-arith.ll *completely* with the new vector shuffle
lowering. For completeness, I've added RUN-lines with the new lowering
here. This is somewhat superfluous as I'm about to flip the default, but
hey, it shows that this actually significantly changed behavior.

The patterns I've added are just ridiculously repetative. Suggestions on
making them better very much welcome. In particular, handling the
commuted form of the v2f64 patterns is somewhat obnoxious.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219033 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 22:43:17 +00:00
Chandler Carruth
91ea3e41ae [x86] Adjust the patterns for lowering X86vzmovl nodes which don't
perform a load to use blendps rather than movss when it is available.

For non-loads, blendps is *much* faster. It can execute on two ports in
Sandy Bridge and Ivy Bridge, and *three* ports on Haswell. This fixes
one of the "regressions" from aggressively taking the "insertion" path
in the new vector shuffle lowering.

This does highlight one problem with blendps -- it isn't commuted as
heavily as it should be. That's future work though.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219022 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 21:38:49 +00:00
Duncan P. N. Exon Smith
83902832de Revert "Revert "DI: Fold constant arguments into a single MDString""
This reverts commit r218918, effectively reapplying r218914 after fixing
an Ocaml bindings test and an Asan crash.  The root cause of the latter
was a tightened-up check in `DILexicalBlock::Verify()`, so I'll file a
PR to investigate who requires the loose check (and why).

Original commit message follows.

--

This patch addresses the first stage of PR17891 by folding constant
arguments together into a single MDString.  Integers are stringified and
a `\0` character is used as a separator.

Part of PR17891.

Note: I've attached my testcases upgrade scripts to the PR.  If I've
just broken your out-of-tree testcases, they might help.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219010 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 20:01:09 +00:00
Adam Nemet
726942c8bb [ISel] Keep matching state consistent when folding during X86 address match
In the X86 backend, matching an address is initiated by the 'addr' complex
pattern and its friends.  During this process we may reassociate and-of-shift
into shift-of-and (FoldMaskedShiftToScaledMask) to allow folding of the
shift into the scale of the address.

However as demonstrated by the testcase, this can trigger CSE of not only the
shift and the AND which the code is prepared for but also the underlying load
node.  In the testcase this node is sitting in the RecordedNode and MatchScope
data structures of the matcher and becomes a deleted node upon CSE.  Returning
from the complex pattern function, we try to access it again hitting an assert
because the node is no longer a load even though this was checked before.

Now obviously changing the DAG this late is bending the rules but I think it
makes sense somewhat.  Outside of addresses we prefer and-of-shift because it
may lead to smaller immediates (FoldMaskAndShiftToScale is an even better
example because it create a non-canonical node).  We currently don't recognize
addresses during DAGCombiner where arguably this canonicalization should be
performed.  On the other hand, having this in the matcher allows us to cover
all the cases where an address can be used in an instruction.

I've also talked a little bit to Dan Gohman on llvm-dev who added the RAUW for
the new shift node in FoldMaskedShiftToScaledMask.  This RAUW is responsible
for initiating the recursive CSE on users
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076903.html) but it
is not strictly necessary since the shift is hooked into the visited user.  Of
course it's safer to keep the DAG consistent at all times (e.g. for accurate
number of uses, etc.).

So rather than changing the fundamentals, I've decided to continue along the
previous patches and detect the CSE.  This patch installs a very targeted
DAGUpdateListener for the duration of a complex-pattern match and updates the
matching state accordingly.  (Previous patches used HandleSDNode to detect the
CSE but that's not practical here).  The listener is only installed on X86.

I tested that there is no measurable overhead due to this while running
through the spec2k BC files with llc.  The only thing we pay for is the
creation of the listener.  The callback never ever triggers in spec2k since
this is a corner case.

Fixes rdar://problem/18206171

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219009 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 20:00:34 +00:00
Chandler Carruth
dce98e6739 [x86] Teach the new vector shuffle lowering to aggressively form MOVSS
and MOVSD nodes for single element vector inserts.

This is particularly important because a number of patterns in the
backend detect these patterns and leverage them to simplify things. It
also fixes quite a few of the insertion bad code examples. However, it
regresses a specific area: when available, blendps and blendpd are
*dramatically* faster than movss and movsd respectively. But it doesn't
really work to form the blend logic first because the blends *aren't* as
crazy efficient when the data is coming from memory anyways, and thus
will have a movss or movsd regardless. Also, doing that would block
a bunch of the patterns that this is designed to hit.

So my plan is to go into the patterns for lowering MOVSS and MOVSD and
lower them via blends when available. However that's a pretty invasive
restructuring so it will need to be a follow-up patch.

I have already gone into the patterns to lower MOVSS and MOVSD from
memory using MOVLPD, etc. Without that, several of the test cases
I already have regress.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218985 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 13:11:13 +00:00
Chandler Carruth
f6fb5f4f2e [x86] Fix the RUN-lines of this test to make sense.
I got them quite wrong when updating it and had the SSE4.1 run checked
for SSE2 and the SSE2 run checked for SSE4.1. I think everything was
actually generic SSE, but this still seems good to fix. While here,
hoist the triple into the IR and make the flag set a bit more direct in
what it is trying to test.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218978 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 11:30:02 +00:00
Chandler Carruth
01b3858e66 [x86] Significantly improve the ability of the new vector shuffle
lowering to match VZEXT_MOVL patterns.

I hadn't realized that these had sufficient pattern smarts in the
backend to lower zext-ing from the low element of a vector without it
being a scalar_to_vector node. They do, and this is how to match a bunch
of patterns for movq, movss, etc.

There is a weird propensity to end up using pshufd to place the element
afterward even though it means domain crossing (or rather, to use
xorps+movss to zext the element rather than movq) but that's an
orthogonal problem with VZEXT_MOVL that someone should probably look at.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218977 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-03 11:25:58 +00:00