- Deciding that insn->sibIndex is SIB_INDEX_NONE does not require another
check beyond the fully decoded bits being equal to 0x4.
The expression insn->sibIndex == SIB_INDEX_sib could not have been true unless
index were 0x4, because SIB_INDEX_sib is merely the range base (SIB_INDEX_EAX)
plus 4. Respectively SIB_INDEX_sib64.
- Don't use a switch statement to perform left-shift.
Differential Revision: http://reviews.llvm.org/D9762
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240598 91177308-0d34-0410-b5e6-96231b3b80d8
COFF and MachO only define symbol sizes for common symbols. Reflect that
in the class hierarchy by having a method for common symbols only in the base
and a general one in ELF.
This avoids the need of using a magic value for the size, which had a few
problems
* Most callers didn't check for it.
* The ones that did could not tell the magic value from a file actually having
that value.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240529 91177308-0d34-0410-b5e6-96231b3b80d8
We used to erroneously match:
(v4i64 shuffle (v2i64 load), <0,0,0,0>)
Whereas vbroadcasti128 is more like:
(v4i64 shuffle (v2i64 load), <0,1,0,1>)
This problem doesn't exist for vbroadcastf128, which kept matching
the intrinsic after r231182. We should perhaps re-introduce the
intrinsic here as well, but that's a separate issue still being
discussed.
While there, add some proper vbroadcastf128 tests. We don't currently
match those, like for loading vbroadcastsd/ss on AVX (the reg-reg
broadcasts where added in AVX2).
Fixes PR23886.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240488 91177308-0d34-0410-b5e6-96231b3b80d8
The summary is that it moves the mangling earlier and replaces a few
calls to .addExternalSymbol with addSym.
I originally wanted to replace all the uses of addExternalSymbol with
addSym, but noticed it was a lot of work and doesn't need to be done
all at once.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240395 91177308-0d34-0410-b5e6-96231b3b80d8
Currently ( D10321, http://reviews.llvm.org/rL239486 ), we can use the machine combiner pass
to reassociate the following sequence to reduce the critical path:
A = ? op ?
B = A op X
C = B op Y
-->
A = ? op ?
B = X op Y
C = A op B
'op' is currently limited to x86 AVX scalar FP adds (with fast-math on), but in theory, it could
be any associative math/logic op (see TODO in code comment).
This patch generalizes the pattern match to ignore the instruction that defines 'A'. So instead of
a sequence of 3 adds, we now only need to find 2 dependent adds and decide if it's worth
reassociating them.
This generalization has a compile-time cost because we can now match more instruction sequences
and we rely more heavily on the machine combiner to discard sequences where reassociation doesn't
improve the critical path.
For example, in the new test case:
A = M div N
B = A add X
C = B add Y
We'll match 2 reassociation patterns, but this transform doesn't reduce the critical path:
A = M div N
B = A add Y
C = B add X
We need the combiner to reject that pattern but select this:
A = M div N
B = X add Y
C = B add A
Differential Revision: http://reviews.llvm.org/D10460
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240361 91177308-0d34-0410-b5e6-96231b3b80d8
The _Int instructions are special, in that they operate on the full
VR128 instead of FR32. The load folding then looks at MOVSS, at the
user, and bails out when it sees a size mismatch.
What we really know is that the rm_Int instructions don't load the
higher lanes, so folding is fine.
This happens for the straightforward intrinsic code, e.g.:
_mm_add_ss(a, _mm_load_ss(p));
Fixes PR23349.
Differential Revision: http://reviews.llvm.org/D10554
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240326 91177308-0d34-0410-b5e6-96231b3b80d8
D8982 ( checked in at http://reviews.llvm.org/rL239001 ) added command-line
options to allow reciprocal estimate instructions to be used in place of
divisions and square roots.
This patch changes the default settings for x86 targets to allow that recip
codegen (except for scalar division because that breaks too much code) when
using -ffast-math or its equivalent.
This matches GCC behavior for this kind of codegen.
Differential Revision: http://reviews.llvm.org/D10396
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240310 91177308-0d34-0410-b5e6-96231b3b80d8
Before this we were producing a TargetExternalSymbol from a MCSymbol.
That meant extracting the symbol name and fetching the symbol again
down the pipeline.
This patch adds a DAG.getMCSymbol that lets the MCSymbol pass unchanged on the
DAG.
Doing so removes the need for MO_NOPREFIX and fixes the root cause of pr23900,
allowing r240130 to be committed again.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240300 91177308-0d34-0410-b5e6-96231b3b80d8
This allows more call sequences to use pushes instead of movs when optimizing for size.
In particular, calling conventions that pass some parameters in registers (e.g. thiscall) are now supported.
Differential Revision: http://reviews.llvm.org/D10500
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240257 91177308-0d34-0410-b5e6-96231b3b80d8
This patch changes getRelocationAddend to use ErrorOr and considers it an error
to try to get the addend of a REL section.
If, for example, a x86_64 file has a REL section, that file is corrupted and
we should reject it.
Using ErrorOr is not ideal since we check the section type once per relocation
instead of once per section.
Checking once per section would involve getRelocationAddend just asserting and
callers checking the section before iterating over the relocations.
In any case, this is an improvement and includes a test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240176 91177308-0d34-0410-b5e6-96231b3b80d8
The patch is generated using this command:
tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py -fix \
-checks=-*,llvm-namespace-comment -header-filter='llvm/.*|clang/.*' \
llvm/lib/
Thanks to Eugene Kosov for the original patch!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240137 91177308-0d34-0410-b5e6-96231b3b80d8
Deduplicates some code and lets us use LEA on atom when adjusting the
stack around callee-cleanup calls. This is the only intended
functionality change.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240044 91177308-0d34-0410-b5e6-96231b3b80d8
Added explicit sign extension for v4i16/v8i16 to v4i32/v8i32 before conversion to floats. Matches existing support for v4i8/v8i8.
Follow up to D10433
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239966 91177308-0d34-0410-b5e6-96231b3b80d8
There is a one-to-one relationship between X86Subtarget and
X86FrameLowering, but every frame lowering method would previously pull
the subtarget off the MachineFunction and query some subtarget
properties.
Over time, these locals began to grow in complexity and it became
important to keep their names and meaning in sync across all of the
frame lowering methods, leading to duplication. We can eliminate that
duplication by computing them once in the constructor.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239948 91177308-0d34-0410-b5e6-96231b3b80d8
The personality routine currently lives in the LandingPadInst.
This isn't desirable because:
- All LandingPadInsts in the same function must have the same
personality routine. This means that each LandingPadInst beyond the
first has an operand which produces no additional information.
- There is ongoing work to introduce EH IR constructs other than
LandingPadInst. Moving the personality routine off of any one
particular Instruction and onto the parent function seems a lot better
than have N different places a personality function can sneak onto an
exceptional function.
Differential Revision: http://reviews.llvm.org/D10429
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239940 91177308-0d34-0410-b5e6-96231b3b80d8
This patch enables support for the conversion of v2i32 to v2f64 to use the CVTDQ2PD xmm instruction and stay on the SSE unit instead of scalarizing, sign extending to i64 and using CVTSI2SDQ scalar conversions.
Differential Revision: http://reviews.llvm.org/D10433
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239855 91177308-0d34-0410-b5e6-96231b3b80d8
Old names, new names, and what they really mean:
- IsWin64 -> IsWin64CC: This is true on non-Windows x86_64 platforms
when the ms_abi calling convention is used.
- IsWinEH -> IsWin64Prologue: True when the target is Win64, regardless
of calling convention. Changes the prologue to obey the constraints of
the Win64 unwinder.
- NeedsWinEH -> NeedsWinCFI: We're using the win64 prologue *and* the we
want .xdata unwind tables. Analogous to NeedsDwarfCFI.
NFC
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239836 91177308-0d34-0410-b5e6-96231b3b80d8
When we multiply two 64-bit vectors, we extract lower and upper part and use the PMULUDQ instruction.
When one of the operands is a constant, the upper part may be zero, we know this at compile time.
Example: %a = mul <4 x i64> %b, <4 x i64> < i64 5, i64 5, i64 5, i64 5>.
I'm checking the value of the upper part and prevent redundant "multiply", "shift" and "add" operations.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239802 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
NFC: no one uses AnalyzeBranchPredicate yet.
Add TargetInstrInfo::AnalyzeBranchPredicate and implement for x86. A
later change adding support for page-fault based implicit null checks
depends on this.
Reviewers: reames, ab, atrick
Reviewed By: atrick
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D10200
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239742 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
TargetInstrInfo::getLdStBaseRegImmOfs to
TargetInstrInfo::getMemOpBaseRegImmOfs and implement for x86. The
implementation only handles a few easy cases now and will be made more
sophisticated in the future.
This is NFCI: the only user of `getLdStBaseRegImmOfs` (now
`getmemOpBaseRegImmOfs`) is `LoadClusterMotion` and `LoadClusterMotion`
is disabled for x86.
Reviewers: reames, ab, MatzeB, atrick
Reviewed By: MatzeB, atrick
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D10199
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239741 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This instruction encodes a loading operation that may fault, and a label
to branch to if the load page-faults. The locations of potentially
faulting loads and their "handler" destinations are recorded in a
FaultMap section, meant to be consumed by LLVM's clients.
Nothing generates FAULTING_LOAD_OP instructions yet, but they will be
used in a future change.
The documentation (FaultMaps.rst) needs improvement and I will update
this diff with a more expanded version shortly.
Depends on D10196
Reviewers: rnk, reames, AndyAyers, ab, atrick, pgavlin
Reviewed By: atrick, pgavlin
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D10197
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239740 91177308-0d34-0410-b5e6-96231b3b80d8
We were putting them in the filter field, which is correct for 64-bit
but wrong for 32-bit.
Also switch the order of scope table entry emission so outermost entries
are emitted first, and fix an obvious state assignment bug.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239574 91177308-0d34-0410-b5e6-96231b3b80d8
Remove the EFLAGS from the stackmap live-out mask. The EFLAGS register is not
supposed to be part of that set, because the X86 calling conventions mark the
register as NOT preserved.
Also remove the IP registers, since spilling and restoring those doesn't really
make any sense.
Related to rdar://problem/21019635.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239568 91177308-0d34-0410-b5e6-96231b3b80d8
This intrinsic is like framerecover plus a load. It recovers the EH
registration stack allocation from the parent frame and loads the
exception information field out of it, giving back a pointer to an
EXCEPTION_POINTERS struct. It's designed for clang to use in SEH filter
expressions instead of accessing the EXCEPTION_POINTERS parameter that
is available on x64.
This required a minor change to MC to allow defining a label variable to
another absolute framerecover label variable.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239567 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
For the moment, TargetMachine::getTargetTriple() still returns a StringRef.
This continues the patch series to eliminate StringRef forms of GNU triples
from the internals of LLVM that began in r239036.
Reviewers: rengolin
Reviewed By: rengolin
Subscribers: ted, llvm-commits, rengolin, jholewinski
Differential Revision: http://reviews.llvm.org/D10362
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239554 91177308-0d34-0410-b5e6-96231b3b80d8
This patch ensures that SHL/SRL/SRA shifts for i8 and i16 vectors avoid scalarization. It builds on the existing i8 SHL vectorized implementation of moving the shift bits up to the sign bit position and separating the 4, 2 & 1 bit shifts with several improvements:
1 - SSE41 targets can use (v)pblendvb directly with the sign bit instead of performing a comparison to feed into a VSELECT node.
2 - pre-SSE41 targets were masking + comparing with an 0x80 constant - we avoid this by using the fact that a set sign bit means a negative integer which can be compared against zero to then feed into VSELECT, avoiding the need for a constant mask (zero generation is much cheaper).
3 - SRA i8 needs to be unpacked to the upper byte of a i16 so that the i16 psraw instruction can be correctly used for sign extension - we have to do more work than for SHL/SRL but perf tests indicate that this is still beneficial.
The i16 implementation is similar but simpler than for i8 - we have to do 8, 4, 2 & 1 bit shifts but less shift masking is involved. SSE41 use of (v)pblendvb requires that the i16 shift amount is splatted to both bytes however.
Tested on SSE2, SSE41 and AVX machines.
Differential Revision: http://reviews.llvm.org/D9474
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239509 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r239437.
This broke clang-cl self-hosts. We'd end up calling the __imp_ symbol
directly instead of using it to do an indirect function call.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239502 91177308-0d34-0410-b5e6-96231b3b80d8
This is a reimplementation of D9780 at the machine instruction level rather than the DAG.
Use the MachineCombiner pass to reassociate scalar single-precision AVX additions (just a
starting point; see the TODO comments) to increase ILP when it's safe to do so.
The code is closely based on the existing MachineCombiner optimization that is implemented
for AArch64.
This patch should not cause the kind of spilling tragedy that led to the reversion of r236031.
Differential Revision: http://reviews.llvm.org/D10321
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239486 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This continues the patch series to eliminate StringRef forms of GNU triples
from the internals of LLVM that began in r239036.
Reviewers: rafael
Reviewed By: rafael
Subscribers: rafael, ted, jfb, llvm-commits, rengolin, jholewinski
Differential Revision: http://reviews.llvm.org/D10311
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239467 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This continues the patch series to eliminate StringRef forms of GNU triples
from the internals of LLVM that began in r239036.
Reviewers: rafael
Reviewed By: rafael
Subscribers: rafael, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D10307
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239465 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This continues the patch series to eliminate StringRef forms of GNU triples
from the internals of LLVM that began in r239036.
Reviewers: echristo, rafael
Reviewed By: rafael
Subscribers: rafael, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D10243
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239464 91177308-0d34-0410-b5e6-96231b3b80d8
We have to do this manually, the runtime only sets up ebp. Fixes a crash
when returning after catching an exception.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239451 91177308-0d34-0410-b5e6-96231b3b80d8
Use a "safeseh" string attribute to do this. You would think we chould
just accumulate the set of personalities like we do on dwarf, but this
fails to account for the LSDA-loading thunks we use for
__CxxFrameHandler3. Each of those needs to make it into .sxdata as well.
The string attribute seemed like the most straightforward approach.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239448 91177308-0d34-0410-b5e6-96231b3b80d8
This gets all the handler info through to the asm printer and we can
look at the .xdata tables now. I've convinced one small catch-all test
case to work, but other than that, it would be a stretch to say this is
functional.
The state numbering algorithm avoids doing any scope reconstruction as
we do for C++ to simplify the implementation.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239433 91177308-0d34-0410-b5e6-96231b3b80d8
that was resetting it.
Remove the uses of DisableTailCalls in subclasses of TargetLowering and use
the value of function attribute "disable-tail-calls" instead. Also,
unconditionally add pass TailCallElim to the pipeline and check the function
attribute at the start of runOnFunction to disable the pass on a per-function
basis.
This is part of the work to remove TargetMachine::resetTargetOptions, and since
DisableTailCalls was the last non-fast-math option that was being reset in that
function, we should be able to remove the function entirely after the work to
propagate IR-level fast-math flags to DAG nodes is completed.
Out-of-tree users should remove the uses of DisableTailCalls and make changes
to attach attribute "disable-tail-calls"="true" or "false" to the functions in
the IR.
rdar://problem/13752163
Differential Revision: http://reviews.llvm.org/D10099
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239427 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This was a longstanding FIXME and is a necessary precursor to cases
where foldOperandImpl may have to create more than one instruction
(e.g. to constrain a register class). This is the split out NFC changes from
D6262.
Reviewers: pete, ributzka, uweigand, mcrosier
Reviewed By: mcrosier
Subscribers: mcrosier, ted, llvm-commits
Differential Revision: http://reviews.llvm.org/D10174
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239336 91177308-0d34-0410-b5e6-96231b3b80d8
While we have some code to transform specification like {ax} into
{eax}/{rax} if the operand type isn't 16bit, we should reject cases
where there is no sane way to do this, like the i128 type in the
example.
Related to rdar://21042280
Differential Revision: http://reviews.llvm.org/D10260
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239309 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
A small bit that I missed when I updated the X86 backend to account for
the Win64 calling convention on non-Windows. Now we don't use dead
non-volatile registers when emitting a Win64 indirect tail call on
non-Windows.
Should fix PR23710.
Test Plan: Added test for the correct behavior based on the case I posted to PR23710.
Reviewers: rnk
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D10258
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239111 91177308-0d34-0410-b5e6-96231b3b80d8
Fix the FIXME and remove this old as(1) compat option. It was useful for
bringup of the integrated assembler to diff object files, but now it's
just causing more relocations than strictly necessary to be generated.
rdar://21201804
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239084 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This is the first of several patches to eliminate StringRef forms of GNU
triples from the internals of LLVM. After this is complete, GNU triples
will be replaced by a more authoratitive representation in the form of
an LLVM TargetTuple.
Reviewers: rengolin
Reviewed By: rengolin
Subscribers: ted, llvm-commits, rengolin, jholewinski
Differential Revision: http://reviews.llvm.org/D10236
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239036 91177308-0d34-0410-b5e6-96231b3b80d8
The first try (r238051) to land this was reverted due to ExecutionEngine build failure;
that was hopefully addressed by r238788.
The second try (r238842) to land this was reverted due to BUILD_SHARED_LIBS failure;
that was hopefully addressed by r238953.
This patch adds a TargetRecip class for processing many recip codegen possibilities.
The class is intended to handle both command-line options to llc as well
as options passed in from a front-end such as clang with the -mrecip option.
The x86 backend is updated to use the new functionality.
Only -mcpu=btver2 with -ffast-math should see a functional change from this patch.
All other x86 CPUs continue to *not* use reciprocal estimates by default with -ffast-math.
Differential Revision: http://reviews.llvm.org/D8982
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239001 91177308-0d34-0410-b5e6-96231b3b80d8
Intel® Memory Protection Extensions (Intel® MPX) is a new feature in Skylake.
It is a part of KNL and SKX sets. It is also a part of Skylake client.
I added definition of %bnd0 - %bnd3 registers, each register is a pair of 64-bit integers.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238916 91177308-0d34-0410-b5e6-96231b3b80d8
This patch removes the old X86ISD::FSRL op - which allowed float vectors to use the byte right shift operations (causing a domain switch....).
Since the refactoring of the shuffle lowering code this no longer has any use.
Differential Revision: http://reviews.llvm.org/D10169
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238906 91177308-0d34-0410-b5e6-96231b3b80d8
The first try (r238051) to land this was reverted due to bot failures
that were hopefully addressed by r238788.
This patch adds a TargetRecip class for processing many recip codegen possibilities.
The class is intended to handle both command-line options to llc as well
as options passed in from a front-end such as clang with the -mrecip option.
The x86 backend is updated to use the new functionality.
Only -mcpu=btver2 with -ffast-math should see a functional change from this patch.
All other x86 CPUs continue to *not* use reciprocal estimates by default with -ffast-math.
Differential Revision: http://reviews.llvm.org/D8982
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238842 91177308-0d34-0410-b5e6-96231b3b80d8
using lowerVectorShuffleWithSHUFPS() and other shuffle-helpers routines.
Added matching of VALIGN instruction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238830 91177308-0d34-0410-b5e6-96231b3b80d8
This is important because of different addressing modes
depending on the address space for GPU targets.
This only adds the argument, and does not update
any of the uses to provide the correct address space.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238723 91177308-0d34-0410-b5e6-96231b3b80d8
best approach of each.
For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.
For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.
If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.
But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.
I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238652 91177308-0d34-0410-b5e6-96231b3b80d8
shorter one. NFC.
In addition to being much shorter to type and requiring fewer arguments,
this change saves over 30 lines from this one file, all wasted on total
boilerplate...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238640 91177308-0d34-0410-b5e6-96231b3b80d8
around a value using its existing SDLoc.
Start using this in just one function to save omg lines of code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238638 91177308-0d34-0410-b5e6-96231b3b80d8
shifting vectors of bytes as x86 doesn't have direct support for that.
This removes a bunch of redundant masking in the generated code for SSE2
and SSE3.
In order to avoid the really significant code size growth this would
have triggered, I also factored the completely repeatative logic for
shifting and masking into two lambdas which in turn makes all of this
much easier to read IMO.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238637 91177308-0d34-0410-b5e6-96231b3b80d8
in-register LUT technique.
Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html
The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.
On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.
The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.
I have extended Bruno's patch in the following ways:
0) I committed the new tests with baseline sequences so this shows
a diff, and regenerated the tests using the update scripts.
1) Bruno had noticed and mentioned in IRC a redundant mask that
I removed.
2) I introduced a particular optimization for the i32 vector cases where
we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
+ PSADBW to compute doubled high i32 popcounts. This takes advantage
of the fact that to line up the high i32 popcounts we have to shift
them anyways, and we can shift them by one fewer bit to effectively
divide the count by two. While the PSHUFD based horizontal add is no
faster, it doesn't require registers or load traffic the way a mask
would, and provides more ILP as it happens on different ports with
high throughput.
3) I did some code cleanups throughout to simplify the implementation
logic.
4) I refactored it to continue to use the parallel bitmath lowering when
SSSE3 is not available to preserve the performance of that version on
SSE2 targets where it is still much better than scalarizing as we'll
still do a bitmath implementation of popcount even in scalar code
there.
With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.
With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.
Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.
Differential Revision: http://reviews.llvm.org/D10084
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238636 91177308-0d34-0410-b5e6-96231b3b80d8
a separate routine, generalize it to work for all the integer vector
sizes, and do general code cleanups.
This dramatically improves lowerings of byte and short element vector
popcount, but more importantly it will make the introduction of the
LUT-approach much cleaner.
The biggest cleanup I've done is to just force the legalizer to do the
bitcasting we need. We run these iteratively now and it makes the code
much simpler IMO. Other changes were minor, and mostly naming and
splitting things up in a way that makes it more clear what is going on.
The other significant change is to use a different final horizontal sum
approach. This is the same number of instructions as the old method, but
shifts left instead of right so that we can clear everything but the
final sum with a single shift right. This seems likely better than
a mask which will usually have to read the mask from memory. It is
certaily fewer u-ops. Also, this will be temporary. This and the LUT
approach share the need of horizontal adds to finish the computation,
and we have more clever approaches than this one that I'll switch over
to.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238635 91177308-0d34-0410-b5e6-96231b3b80d8
It turns out that _except_handler3 and _except_handler4 really use the
same stack allocation layout, at least today. They just make different
choices about encoding the LSDA.
This is in preparation for lowering the llvm.eh.exceptioninfo().
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238627 91177308-0d34-0410-b5e6-96231b3b80d8
The value in 'ebp' acts as an implicit argument to the outlined
handlers, and is recovered with frameaddress(1).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238619 91177308-0d34-0410-b5e6-96231b3b80d8
Small (really small!) C++ exception handling examples work on 32-bit x86
now.
This change disables the use of .seh_* directives in WinException when
CFI is not in use. It also uses absolute symbol references in the tables
instead of imagerel32 relocations.
Also fixes a cache invalidation bug in MMI personality classification.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238575 91177308-0d34-0410-b5e6-96231b3b80d8
MIOperands/ConstMIOperands are classes iterating over the MachineOperand
of a MachineInstr, however MachineInstr::mop_iterator does the same
thing.
I assume these two iterators exist to have a uniform interface to
iterate over the operands of a machine instruction bundle and a single
machine instruction. However in practice I find it more confusing to have 2
different iterator classes, so this patch transforms (nearly all) the
code to use mop_iterators.
The only exception being MIOperands::anlayzePhysReg() and
MIOperands::analyzeVirtReg() still needing an equivalent, I leave that
as an exercise for the next patch.
Differential Revision: http://reviews.llvm.org/D9932
This version is slightly modified from the proposed revision in that it
introduces MachineInstr::getOperandNo to avoid the extra counting
variable in the few loops that previously used MIOperands::getOperandNo.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238539 91177308-0d34-0410-b5e6-96231b3b80d8
This moves all the state numbering code for C++ EH to WinEHPrepare so
that we can call it from the X86 state numbering IR pass that runs
before isel.
Now we just call the same state numbering machinery and insert a bunch
of stores. It also populates MachineModuleInfo with information about
the current function.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238514 91177308-0d34-0410-b5e6-96231b3b80d8
For x86 targets, do not do sibling call optimization when materializing
the callee's address would require a GOT relocation. We can still do
tail calls to internal functions, hidden functions, and protected
functions, because they do not require this kind of relocation. It is
still possible to get GOT relocations when the user explicitly asks for
it with musttail or -tailcallopt, both of which are supposed to
guarantee TCO.
Based on a patch by Chih-hung Hsieh.
Reviewers: srhines, timmurray, danalbert, enh, void, nadav, rnk
Subscribers: joerg, davidxl, llvm-commits
Differential Revision: http://reviews.llvm.org/D9799
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238487 91177308-0d34-0410-b5e6-96231b3b80d8
With this patch the x86 backend is now shrink-wrapping capable
and this functionality can be tested by using the
-enable-shrink-wrap switch.
The next step is to make more test and enable shrink-wrapping by
default for x86.
Related to <rdar://problem/20821487>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238293 91177308-0d34-0410-b5e6-96231b3b80d8
This gets gas and llc -filetype=obj to agree on the order of prefixes.
For llvm-mc we need to fix the asm parser to know that it makes a difference
on which line the "lock" is in.
Part of pr23594.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238232 91177308-0d34-0410-b5e6-96231b3b80d8
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first several times this was committed (e.g. r229831, r233055), it caused several buildbot failures.
Apparently the reason for most failures was both clang and gcc's inability to deal with large numbers (> 10K) of bitset constructor calls in tablegen-generated initializers of instruction info tables.
This should now be fixed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238192 91177308-0d34-0410-b5e6-96231b3b80d8
Part of D9474, this patch extends AVX2 v16i16 types to 2 x 8i32 vectors and uses i32 shift variable shifts before packing back to i16.
Adds AVX2 tests for v8i16 and v16i16
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238149 91177308-0d34-0410-b5e6-96231b3b80d8
The semantics of the scalar FMA intrinsics are that the high vector elements are copied from the first source.
The existing pattern switches src1 and src2 around, to match the "213" order, which ends up tying the original src2 to the dest. Since the actual scalar fma3 instructions copy the high elements from the dest register, the wrong values are copied.
This modifies the pattern to leave src1 and src2 in their original order.
Differential Revision: http://reviews.llvm.org/D9908
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238131 91177308-0d34-0410-b5e6-96231b3b80d8