both fallthrough and a conditional branch target the same successor.
Gracefully delete the conditional branch and introduce any unconditional
branch needed to reach the actual successor. This fixes memory
corruption in 2009-06-15-RegScavengerAssert.ll and possibly other tests.
Also, while I'm here fix a latent bug I spotted by inspection. I never
applied the same fundamental fix to this fallthrough successor finding
logic that I did to the logic used when there are no conditional
branches. As a consequence it would have selected landing pads had they
be aligned in just the right way here. I don't have a test case as
I spotted this by inspection, and the previous time I found this
required have of TableGen's source code to produce it. =/ I hate backend
bugs. ;]
Thanks to Jim Grosbach for helping me reason through this and reviewing
the fix.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154867 91177308-0d34-0410-b5e6-96231b3b80d8
This is mostly to test the waters. I'd like to get results from FNT
build bots and other bots running on non-x86 platforms.
This feature has been pretty heavily tested over the last few months by
me, and it fixes several of the execution time regressions caused by the
inlining work by preventing inlining decisions from radically impacting
block layout.
I've seen very large improvements in yacr2 and ackermann benchmarks,
along with the expected noise across all of the benchmark suite whenever
code layout changes. I've analyzed all of the regressions and fixed
them, or found them to be impossible to fix. See my email to llvmdev for
more details.
I'd like for this to be in 3.1 as it complements the inliner changes,
but if any failures are showing up or anyone has concerns, it is just
a flag flip and so can be easily turned off.
I'm switching it on tonight to try and get at least one run through
various folks' performance suites in case SPEC or something else has
serious issues with it. I'll watch bots and revert if anything shows up.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154816 91177308-0d34-0410-b5e6-96231b3b80d8
once we start changing the block layout, so just nuke it. If anyone has
ideas about how to craft a code layout agnostic form of the test please
let me know.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154815 91177308-0d34-0410-b5e6-96231b3b80d8
rotation. When there is a loop backedge which is an unconditional
branch, we will end up with a branch somewhere no matter what. Try
placing this backedge in a fallthrough position above the loop header as
that will definitely remove at least one branch from the loop iteration,
where whole loop rotation may not.
I haven't seen any benchmarks where this is important but loop-blocks.ll
tests for it, and so this will be covered when I flip the default.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154812 91177308-0d34-0410-b5e6-96231b3b80d8
laid out in a form with a fallthrough into the header and a fallthrough
out of the bottom. In that case, leave the loop alone because any
rotation will introduce unnecessary branches. If either side looks like
it will require an explicit branch, then the rotation won't add any, do
it to ensure the branch occurs outside of the loop (if possible) and
maximize the benefit of the fallthrough in the bottom.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154806 91177308-0d34-0410-b5e6-96231b3b80d8
This is a complex change that resulted from a great deal of
experimentation with several different benchmarks. The one which proved
the most useful is included as a test case, but I don't know that it
captures all of the relevant changes, as I didn't have specific
regression tests for each, they were more the result of reasoning about
what the old algorithm would possibly do wrong. I'm also failing at the
moment to craft more targeted regression tests for these changes, if
anyone has ideas, it would be welcome.
The first big thing broken with the old algorithm is the idea that we
can take a basic block which has a loop-exiting successor and a looping
successor and use the looping successor as the layout top in order to
get that particular block to be the bottom of the loop after layout.
This happens to work in many cases, but not in all.
The second big thing broken was that we didn't try to select the exit
which fell into the nearest enclosing loop (to which we exit at all). As
a consequence, even if the rotation worked perfectly, it would result in
one of two bad layouts. Either the bottom of the loop would get
fallthrough, skipping across a nearer enclosing loop and thereby making
it discontiguous, or it would be forced to take an explicit jump over
the nearest enclosing loop to earch its successor. The point of the
rotation is to get fallthrough, so we need it to fallthrough to the
nearest loop it can.
The fix to the first issue is to actually layout the loop from the loop
header, and then rotate the loop such that the correct exiting edge can
be a fallthrough edge. This is actually much easier than I anticipated
because we can handle all the hard parts of finding a viable rotation
before we do the layout. We just store that, and then rotate after
layout is finished. No inner loops get split across the post-rotation
backedge because we check for them when selecting the rotation.
That fix exposed a latent problem with our exitting block selection --
we should allow the backedge to point into the middle of some inner-loop
chain as there is no real penalty to it, the whole point is that it
*won't* be a fallthrough edge. This may have blocked the rotation at all
in some cases, I have no idea and no test case as I've never seen it in
practice, it was just noticed by inspection.
Finally, all of these fixes, and studying the loops they produce,
highlighted another problem: in rotating loops like this, we sometimes
fail to align the destination of these backwards jumping edges. Fix this
by actually walking the backwards edges rather than relying on loopinfo.
This fixes regressions on heapsort if block placement is enabled as well
as lots of other cases where the previous logic would introduce an
abundance of unnecessary branches into the execution.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154783 91177308-0d34-0410-b5e6-96231b3b80d8
Original message:
Modify the code that lowers shuffles to blends from using blendvXX to vblendXX.
blendV uses a register for the selection while Vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154483 91177308-0d34-0410-b5e6-96231b3b80d8
blendv uses a register for the selection while vblend uses an immediate.
On sandybridge they still have the same latency and execute on the same execution ports.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154396 91177308-0d34-0410-b5e6-96231b3b80d8
x86 addressing modes. This allows PIE-based TLS offsets to fit directly
into an addressing mode immediate offset, which is the last remaining
code quality issue from PR12380. With this patch, that PR is completely
fixed.
To understand why this patch is correct to match these offsets into
addressing mode immediates, break it down by cases:
1) 32-bit is trivially correct, and unmodified here.
2) 64-bit non-small mode is unchanged and never matches.
3) 64-bit small PIC code which is RIP-relative is handled specially in
the match to try to fit RIP into the base register. If it fails, it
now early exits. This behavior is unchanged by the patch.
4) 64-bit small non-PIC code which is not RIP-relative continues to work
as it did before. The reason these immediates are safe is because the
ABI ensures they fit in small mode. This behavior is unchanged.
5) 64-bit small PIC code which is *not* using RIP-relative addressing.
This is the only case changed by the patch, and the primary place you
see it is in TLS, either the win64 section offset TLS or Linux
local-exec TLS model in a PIC compilation. Here the ABI again ensures
that the immediates fit because we are in small mode, and any other
operations required due to the PIC relocation model have been handled
externally to the Wrapper node (extra loads etc are made around the
wrapper node in ISelLowering).
I've tested this as much as I can comparing it with GCC's output, and
everything appears safe. I discussed this with Anton and it made sense
to him at least at face value. That said, if there are issues with PIC
code after this patch, yell and we can revert it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154304 91177308-0d34-0410-b5e6-96231b3b80d8
comprehensive testing of TLS codegen for x86. Convert all of the ones
that were still using grep to use FileCheck. Remove some redundancies
between them.
Perhaps most interestingly expand the test cases so that they actually
fully list the instruction snippet being tested. TLS operations are
*very* narrowly defined, and so these seem reasonably stable. More
importantly, the existing test cases already were crazy fine grained,
expecting specific registers to be allocated. This just clarifies that
no *other* instructions are expected, and fills in some crucial gaps
that weren't being tested at all.
This will make any subsequent changes to TLS much more clear during
review.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154303 91177308-0d34-0410-b5e6-96231b3b80d8
when -ffast-math, i.e. don't just always do it if the reciprocal can
be formed exactly. There is already an IR level transform that does
that, and it does it more carefully.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154296 91177308-0d34-0410-b5e6-96231b3b80d8
optimizations which are valid for position independent code being linked
into a single executable, but not for such code being linked into
a shared library.
I discussed the design of this with Eric Christopher, and the decision
was to support an optional bit rather than a completely separate
relocation model. Fundamentally, this is still PIC relocation, its just
that certain optimizations are only valid under a PIC relocation model
when the resulting code won't be in a shared library. The simplest path
to here is to expose a single bit option in the TargetOptions. If folks
have different/better designs, I'm all ears. =]
I've included the first optimization based upon this: changing TLS
models to the *Exec models when PIE is enabled. This is the LLVM
component of PR12380 and is all of the hard work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154294 91177308-0d34-0410-b5e6-96231b3b80d8
Previously we used three instructions to broadcast an immediate value into a
vector register.
On Sandybridge we continue to load the broadcasted value from the constant pool.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154284 91177308-0d34-0410-b5e6-96231b3b80d8
shuffle node because it could introduce new shuffle nodes that were not
supported efficiently by the target.
2. Add a more restrictive shuffle-of-shuffle optimization for cases where the
second shuffle reverses the transformation of the first shuffle.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154266 91177308-0d34-0410-b5e6-96231b3b80d8
reciprocal if converting to the reciprocal is exact. Do it even if inexact
if -ffast-math. This substantially speeds up ac.f90 from the polyhedron
benchmarks.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154265 91177308-0d34-0410-b5e6-96231b3b80d8
by default.
This is a behaviour configurable in the MCAsmInfo. I've decided to turn
it on by default in (possibly optimistic) hopes that most assemblers are
reasonably sane. If this proves a problem, switching to default seems
reasonable.
I'm not sure if this is the opportune place to test, but it seemed good
to make sure it was tested somewhere.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154235 91177308-0d34-0410-b5e6-96231b3b80d8
LSR always tries to make the ICmp in the loop latch use the incremented
induction variable. This allows the induction variable to be kept in a
single register.
When the induction variable limit is equal to the stride,
SimplifySetCC() would break LSR's hard work by transforming:
(icmp (add iv, stride), stride) --> (cmp iv, 0)
This forced us to use lea for the IC update, preventing the simpler
incl+cmp.
<rdar://problem/7643606>
<rdar://problem/11184260>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@154119 91177308-0d34-0410-b5e6-96231b3b80d8
This is just the fallback tie-breaker ordering, the main allocation
order is still descending size.
Patch by Shamil Kurmangaleev!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153904 91177308-0d34-0410-b5e6-96231b3b80d8
Do not try to optimize swizzles of shuffles if the source shuffle has more than
a single user, except when the source shuffle is also a swizzle.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153864 91177308-0d34-0410-b5e6-96231b3b80d8
This is the CodeGen equivalent of r153747. I tested that there is not noticeable
performance difference with any combination of -O0/-O2 /-g when compiling
gcc as a single compilation unit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153817 91177308-0d34-0410-b5e6-96231b3b80d8
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153635 91177308-0d34-0410-b5e6-96231b3b80d8
This is a code change to add support for changing instruction sequences of the form:
load
inc/dec of 8/16/32/64 bits
store
into the appropriate X86 inc/dec through memory instruction:
inc[qlwb] / dec[qlwb]
The checks that were in X86DAGToDAGISel::Select(SDNode *Node)>>ISD::STORE have been extracted to isLoadIncOrDecStore and reworked to use the better
named wrappers for getOperand(unsigned) (e.g. getOffset()) and replaced Chain.getNode() with LoadNode. The comments have also been expanded.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153617 91177308-0d34-0410-b5e6-96231b3b80d8
* Removed test/lib/llvm.exp - it is no longer needed
* Deleted the dg.exp reading code from test/lit.cfg. There are no dg.exp files
left in the test suite so this code is no longer required. test/lit.cfg is
now much shorter and clearer
* Removed a lot of duplicate code in lit.local.cfg files that need access to
the root configuration, by adding a "root" attribute to the TestingConfig
object. This attribute is dynamically computed to provide the same
information as was previously provided by the custom getRoot functions.
* Documented the config.root attribute in docs/CommandGuide/lit.pod
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153408 91177308-0d34-0410-b5e6-96231b3b80d8
This results in things such as
vmovups 16(%rdi), %xmm0
vinsertf128 $1, %xmm0, %ymm0, %ymm0
to be combined to
vinsertf128 $1, 16(%rdi), %ymm0, %ymm0
rdar://11076953
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153092 91177308-0d34-0410-b5e6-96231b3b80d8
i128). In that case, we may not be able to print out the MCExpr as an
expression. For instance, we could have an MCExpr like this:
0xBEEF0000BEEF0000 | (0xBEEF0000BEEF0000 << 64)
The MCExpr printer handles sizes up to 64-bits, but this expression would
require 128-bits. In this situation, try to evaluate the constant expression and
emit that as the value into 64-bit chunks.
<rdar://problem/11070338>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153081 91177308-0d34-0410-b5e6-96231b3b80d8
X86InstrCompiler.td.
It also adds –mcpu-generic to the legalize-shift-64.ll test so the test
will pass if run on an Intel Atom CPU, which would otherwise
produce an instruction schedule which differs from that which the test expects.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@153033 91177308-0d34-0410-b5e6-96231b3b80d8
This results in things such as
vmovaps -96(%rbx), %xmm1
vinsertf128 $1, %xmm1, %ymm0, %ymm0
to be combined to
vinsertf128 $1, -96(%rbx), %ymm0, %ymm0
rdar://10643481
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@152762 91177308-0d34-0410-b5e6-96231b3b80d8
Original commit message from r147481:
DAGCombine for transforming 128->256 casts into a vmovaps, rather
then a vxorps + vinsertf128 pair if the original vector came from a load.
Fix:
Unaligned loads need to generate a vmovups.
rdar://10974078
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@152366 91177308-0d34-0410-b5e6-96231b3b80d8
This was testing the handling of sub-register coalescing followed by
remat. The original problem was caused by the extra <imp-def> operands
added by sub-register coalescing. Those <imp-def> operands are not
added any longer, and the test case passes even when the original patch
is reverted.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@152040 91177308-0d34-0410-b5e6-96231b3b80d8
Some BBs can become dead after codegen preparation. If we delete them here, it
could help enable tail-call optimizations later on.
<rdar://problem/10256573>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@152002 91177308-0d34-0410-b5e6-96231b3b80d8
In this instance we are generating the tail-call during legalizeDAG. The 2nd
floor call can't be a tail call because it clobbers %xmm1, which is defined by
the first floor call. The first floor call can't be a tail-call because it's
not in the tail position. The only reasonable way I could think to fix this
in a target-independent manner was to check for glue logic on the copy reg.
rdar://10930395
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151877 91177308-0d34-0410-b5e6-96231b3b80d8
so that the test will not fail when run on an Intel Atom
processor, due to the Atom scheduler producing an instruction sequence that is
different from that which is normally expected.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151832 91177308-0d34-0410-b5e6-96231b3b80d8
To avoid problems with zero shifts when getting the bits that move between words
we use a trick: first shift the by amount-1, then do another shift by one. When
amount is 0 (and size 32) we first shift by 31, then by one, instead of by 32.
Also fix a latent bug that emitted the low and high words in the wrong order
when shifting right.
Fixes PR12113.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151637 91177308-0d34-0410-b5e6-96231b3b80d8
When the GEP index is a vector of pointers, the code that calculated the size
of the element started from the vector type, and not the contained pointer type.
As a result, instead of looking at the data element pointed by the vector, this
code used the size of the vector. This works for 32bit members (on 32bit
systems), but not for other types. Added code to peel the vector type and
added a test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151626 91177308-0d34-0410-b5e6-96231b3b80d8
[Joe Groff] Hi everyone. My previous patch applied as r151382 had a few problems:
Clang raised a warning, and X86 LowerOperation would assert out for
fptoui f64 to i32 because it improperly lowered to an illegal
BUILD_PAIR. Here's a patch that addresses these issues. Let me know if
any other changes are necessary. Thanks.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151432 91177308-0d34-0410-b5e6-96231b3b80d8
ecx = mov eax
al = mov ch
The second copy is not a nop because the sub-indices of ecx,ch is not the
same of that of eax/al.
Re-enabled machine-cp.
PR11940
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@151002 91177308-0d34-0410-b5e6-96231b3b80d8
Thanks to Anton, Duncan and Rafael for helping me track this down.
Pointy hat to Rafael for introducing the bug in the first place.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150811 91177308-0d34-0410-b5e6-96231b3b80d8
processor, due to the Atom scheduler producing an instruction sequence that is
different from that which is expected.
Patch by Michael Spencer!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150736 91177308-0d34-0410-b5e6-96231b3b80d8
that are greater than the vector element type. For example BUILD_VECTOR
of type <1 x i1> with a constant i8 operand.
This patch fixes the assertion.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150477 91177308-0d34-0410-b5e6-96231b3b80d8
If the DEC node had more than one user, it was doing this lowering but
leaving the original DEC node around and so decrementing twice.
Fixes PR11964.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150356 91177308-0d34-0410-b5e6-96231b3b80d8
v8i8 -> v8i32 on AVX machines. The codegen often scalarizes ANY_EXTEND nodes.
The DAGCombiner has two optimizations that can mitigate the problem. First,
if all of the operands of a BUILD_VECTOR node are extracted from an ZEXT/ANYEXT
nodes, then it is possible to create a new simplified BUILD_VECTOR which uses
UNDEFS/ZERO values to eliminate the scalar ZEXT/ANYEXT nodes.
Second, another dag combine optimization lowers BUILD_VECTOR into a shuffle
vector instruction.
In the case of zext v8i8->v8i32 on AVX, a value in an XMM register is to be
shuffled into a wide YMM register.
This patch modifes the second optimization and allows the creation of
shuffle vectors even when the newly generated vector and the original vector
from which we extract the values are of different types.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150340 91177308-0d34-0410-b5e6-96231b3b80d8
Creates a configurable regalloc pipeline.
Ensure specific llc options do what they say and nothing more: -reglloc=... has no effect other than selecting the allocator pass itself. This patch introduces a new umbrella flag, "-optimize-regalloc", to enable/disable the optimizing regalloc "superpass". This allows for example testing coalscing and scheduling under -O0 or vice-versa.
When a CodeGen pass requires the MachineFunction to have a particular property, we need to explicitly define that property so it can be directly queried rather than naming a specific Pass. For example, to check for SSA, use MRI->isSSA, not addRequired<PHIElimination>.
CodeGen transformation passes are never "required" as an analysis
ProcessImplicitDefs does not require LiveVariables.
We have a plan to massively simplify some of the early passes within the regalloc superpass.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@150226 91177308-0d34-0410-b5e6-96231b3b80d8
In this patch we optimize this pattern and convert the sequence into extract op of a narrow type.
This allows the BUILD_VECTOR dag optimizations to construct efficient shuffle operations in many cases.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149692 91177308-0d34-0410-b5e6-96231b3b80d8
Adds an instruction itinerary to all x86 instructions, giving each a default latency of 1, using the InstrItinClass IIC_DEFAULT.
Sets specific latencies for Atom for the instructions in files X86InstrCMovSetCC.td, X86InstrArithmetic.td, X86InstrControl.td, and X86InstrShiftRotate.td. The Atom latencies for the remainder of the x86 instructions will be set in subsequent patches.
Adds a test to verify that the scheduler is working.
Also changes the scheduling preference to "Hybrid" for i386 Atom, while leaving x86_64 as ILP.
Patch by Preston Gurd!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149558 91177308-0d34-0410-b5e6-96231b3b80d8
vectors of all one bits to be printed more cleverly in the AsmPrinter.
Unfortunately, the byte value for all one bits is the same with
-fsigned-char as the error return of '-1'. Force this to be the unsigned
byte value when returning it to avoid this problem, and update the test
case for the shiny new behavior.
Yay for building LLVM and Clang with -funsigned-char.
Chris, please review, and let me know if there is any reason to not
desire this change. It seems good on the surface, and certainly intended
based on the code written.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149299 91177308-0d34-0410-b5e6-96231b3b80d8
Unfortunately I also had to disable constant-pool-sharing.ll the code it tests has been
updated to use the IL logic.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149148 91177308-0d34-0410-b5e6-96231b3b80d8
The Win64 calling convention has xmm6-15 as callee-saved while still
clobbering all ymm registers.
Add a YMM_HI_6_15 pseudo-register that aliases the clobbered part of the
ymm registers, and mark that as call-clobbered. This allows live xmm
registers across calls.
This hack wouldn't be necessary with RegisterMask operands representing
the call clobbers, but they are not quite operational yet.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149088 91177308-0d34-0410-b5e6-96231b3b80d8
. "fptosi" and "fptoui" IR instructions are defined with round-to-zero rounding mode.
. Currently for AVX mode for <4xdouble> and <8xdouble> the "VCVTPD2DQ.128" and "VCVTPD2DQ.256" instructions are selected (for .fp_to_sint. DAG node operation ) by AVX codegen. However they use round-to-nearest-even rounding mode.
. Consequently, the conversion produces incorrect numbers.
The fix is to replace selection of VCVTPD2DQ instructions with VCVTTPD2DQ instructions. The latter use truncate (i.e. round-to-zero) rounding mode.
As .fp_to_sint. DAG node operation is used only for lowering of "fptosi" and "fptoui" IR instructions, the fix in X86InstrSSE.td definition file doesn.t have an impact on other LLVM flows.
The patch includes changes in the .td file, LIT test for the changes and a fix in a legacy LIT test (which produced asm code conflicting with LLVN IR spec).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@149056 91177308-0d34-0410-b5e6-96231b3b80d8
In CanXFormVExtractWithShuffleIntoLoad we assumed that EXTRACT_VECTOR_ELT can be later handled by the DAGCombiner.
However, in some cases on AVX, the EXTRACT_VECTOR_ELT is legalized to EXTRACT_SUBVECTOR + EXTRACT_VECTOR_ELT, which
currently is not handled by the DAGCombiner. In this patch I added a check that we only extract from the XMM part.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@148298 91177308-0d34-0410-b5e6-96231b3b80d8
We know that the blend instructions only use the MSB, so if the mask is
sign-extended then we can convert it into a SHL instruction. This is a
common pattern because the type-legalizer sign-extends the i1 type which
is used by the LLVM-IR for the condition.
Added a new optimization in SimplifyDemandedBits for SIGN_EXTEND_INREG -> SHL.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@148225 91177308-0d34-0410-b5e6-96231b3b80d8
lc: X86ISelLowering.cpp:6480: llvm::SDValue llvm::X86TargetLowering::LowerVECTOR_SHUFFLE(llvm::SDValue, llvm::SelectionDAG&) const: Assertion `V1.getOpcode() != ISD::UNDEF&& "Op 1 of shuffle should not be undef"' failed.
Added a test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@148044 91177308-0d34-0410-b5e6-96231b3b80d8
Uses the pvArbitrary slot of the TIB, which is reserved for applications. We
only support frames with a static size.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@148040 91177308-0d34-0410-b5e6-96231b3b80d8
When we load the v12i32 type, the GenWidenVectorLoads method generates two loads: v8i32 and v4i32
and attempts to use CONCAT_VECTORS to join them. In this fix I concat undef values to widen
the smaller value. The test "widen_load-2.ll" also exposes this bug on AVX.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147964 91177308-0d34-0410-b5e6-96231b3b80d8
This uses TLS slot 90, which actually belongs to JavaScriptCore. We only support
frames with static size
Patch by Brian Anderson.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147960 91177308-0d34-0410-b5e6-96231b3b80d8
hoped this would revive one of the llvm-gcc selfhost build bots, but it
didn't so it doesn't appear that my transform is the culprit.
If anyone else is seeing failures, please let me know!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147957 91177308-0d34-0410-b5e6-96231b3b80d8
strange build bot failures that look like a miscompile into an infloop.
I'll investigate this tomorrow, but I'd both like to know whether my
patch is the culprit, and get the bots back to green.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147945 91177308-0d34-0410-b5e6-96231b3b80d8
detect a pattern which can be implemented with a small 'shl' embedded in
the addressing mode scale. This happens in real code as follows:
unsigned x = my_accelerator_table[input >> 11];
Here we have some lookup table that we look into using the high bits of
'input'. Each entity in the table is 4-bytes, which means this
implicitly gets turned into (once lowered out of a GEP):
*(unsigned*)((char*)my_accelerator_table + ((input >> 11) << 2));
The shift right followed by a shift left is canonicalized to a smaller
shift right and masking off the low bits. That hides the shift right
which x86 has an addressing mode designed to support. We now detect
masks of this form, and produce the longer shift right followed by the
proper addressing mode. In addition to saving a (rather large)
instruction, this also reduces stalls in Intel chips on benchmarks I've
measured.
In order for all of this to work, one part of the DAG needs to be
canonicalized *still further* than it currently is. This involves
removing pointless 'trunc' nodes between a zextload and a zext. Without
that, we end up generating spurious masks and hiding the pattern.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147936 91177308-0d34-0410-b5e6-96231b3b80d8
Consider this code:
int h() {
int x;
try {
x = f();
g();
} catch (...) {
return x+1;
}
return x;
}
The variable x is undefined on the first edge to the landing pad, but it
has the f() return value on the second edge to the landing pad.
SplitAnalysis::getLastSplitPoint() would assume that the return value
from f() was live into the landing pad when f() throws, which is of
course impossible.
Detect these cases, and treat them as if the landing pad wasn't there.
This allows spill code to be inserted after the function call to f().
<rdar://problem/10664933>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147912 91177308-0d34-0410-b5e6-96231b3b80d8
Add a test that checks the stack alignment of a simple function for
Darwin, Linux and NetBSD for 32bit and 64bit mode.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147888 91177308-0d34-0410-b5e6-96231b3b80d8
define physical registers. It's currently very restrictive, only catching
cases where the CE is in an immediate (and only) predecessor. But it catches
a surprising large number of cases.
rdar://10660865
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147827 91177308-0d34-0410-b5e6-96231b3b80d8
opportunities that only present themselves after late optimizations
such as tail duplication .e.g.
## BB#1:
movl %eax, %ecx
movl %ecx, %eax
ret
The register allocator also leaves some of them around (due to false
dep between copies from phi-elimination, etc.)
This required some changes in codegen passes. Post-ra scheduler and the
pseudo-instruction expansion passes have been moved after branch folding
and tail merging. They were before branch folding before because it did
not always update block livein's. That's fixed now. The pass change makes
independently since we want to properly schedule instructions after
branch folding / tail duplication.
rdar://10428165
rdar://10640363
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147716 91177308-0d34-0410-b5e6-96231b3b80d8
a combined-away node and the result of the combine isn't substantially
smaller than the input, it's just canonicalized. This is the first part
of a significant (7%) performance gain for Snappy's hot decompression
loop.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147604 91177308-0d34-0410-b5e6-96231b3b80d8
Testing: passed 'make check' including LIT tests for all sequences being handled (both SSE and AVX)
Reviewers: Evan Cheng, David Blaikie, Bruno Lopes, Elena Demikhovsky, Chad Rosier, Anton Korobeynikov
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147601 91177308-0d34-0410-b5e6-96231b3b80d8
(x > y) ? x : y
=>
(x >= y) ? x : y
So for something like
(x - y) > 0 : (x - y) ? 0
It will be
(x - y) >= 0 : (x - y) ? 0
This makes is possible to test sign-bit and eliminate a comparison against
zero. e.g.
subl %esi, %edi
testl %edi, %edi
movl $0, %eax
cmovgl %edi, %eax
=>
xorl %eax, %eax
subl %esi, $edi
cmovsl %eax, %edi
rdar://10633221
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147512 91177308-0d34-0410-b5e6-96231b3b80d8
The failure seen on win32, when i64 type is illegal.
It happens on stage of conversion VECTOR_SHUFFLE to BUILD_VECTOR.
The failure message is:
llc: SelectionDAG.cpp:784: void VerifyNodeCommon(llvm::SDNode*): Assertion `(I->getValueType() == EltVT || (EltVT.isInteger() && I->getValueType().isInteger() && EltVT.bitsLE(I->getValueType()))) && "Wrong operand type!"' failed.
I added a special test that checks vector shuffle on win32.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147445 91177308-0d34-0410-b5e6-96231b3b80d8
The failure seen on win32, when i64 type is illegal.
It happens on stage of conversion VECTOR_SHUFFLE to BUILD_VECTOR.
The failure message is:
llc: SelectionDAG.cpp:784: void VerifyNodeCommon(llvm::SDNode*): Assertion `(I->getValueType() == EltVT || (EltVT.isInteger() && I->getValueType().isInteger() && EltVT.bitsLE(I->getValueType()))) && "Wrong operand type!"' failed.
I added a special test that checks vector shuffle on win32.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147399 91177308-0d34-0410-b5e6-96231b3b80d8
Promotion of the mask operand needs to be done using PromoteTargetBoolean, and not padded with garbage.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147309 91177308-0d34-0410-b5e6-96231b3b80d8
Matching MOVLP mask for AVX (265-bit vectors) was wrong.
The failure was detected by conformance tests.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147308 91177308-0d34-0410-b5e6-96231b3b80d8
LZCNT instructions are available. Force promotion to i32 to get
a smaller encoding since the fix-ups necessary are just as complex for
either promoted type
We can't do standard promotion for CTLZ when lowering through BSR
because it results in poor code surrounding the 'xor' at the end of this
instruction. Essentially, if we promote the entire CTLZ node to i32, we
end up doing the xor on a 32-bit CTLZ implementation, and then
subtracting appropriately to get back to an i8 value. Instead, our
custom logic just uses the knowledge of the incoming size to compute
a perfect xor. I'd love to know of a way to fix this, but so far I'm
drawing a blank. I suspect the legalizer could be more clever and/or it
could collude with the DAG combiner, but how... ;]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147251 91177308-0d34-0410-b5e6-96231b3b80d8
my C-brain happy. Remove the unnecessary bits of pedantic IR fluff like
nounwind. Remove stray uses comments. Name things semantically rather
than tN so that adding a new test in the middle doesn't cause pain, and
so that new tests can be grouped semantically.
This exposes how little systematic testing is going on here. I noticed
this by finding several bugs via inspection and wondering why this test
wasn't catching any of them. =[
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147248 91177308-0d34-0410-b5e6-96231b3b80d8
'bsf' instructions here.
This one is actually debatable to my eyes. It's not clear that any chip
implementing 'tzcnt' would have a slow 'bsf' for any reason, and unless
EFLAGS or a zero input matters, 'tzcnt' is just a longer encoding.
Still, this restores the old behavior with 'tzcnt' enabled for now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147246 91177308-0d34-0410-b5e6-96231b3b80d8
X86ISelLowering C++ code. Because this is lowered via an xor wrapped
around a bsr, we want the dagcombine which runs after isel lowering to
have a chance to clean things up. In particular, it is very common to
see code which looks like:
(sizeof(x)*8 - 1) ^ __builtin_clz(x)
Which is trying to compute the most significant bit of 'x'. That's
actually the value computed directly by the 'bsr' instruction, but if we
match it too late, we'll get completely redundant xor instructions.
The more naive code for the above (subtracting rather than using an xor)
still isn't handled correctly due to the dagcombine getting confused.
Also, while here fix an issue spotted by inspection: we should have been
expanding the zero-undef variants to the normal variants when there is
an 'lzcnt' instruction. Do so, and test for this. We don't want to
generate unnecessary 'bsr' instructions.
These two changes fix some regressions in encoding and decoding
benchmarks. However, there is still a *lot* to be improve on in this
type of code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@147244 91177308-0d34-0410-b5e6-96231b3b80d8
use the zero-undefined variants of CTTZ and CTLZ. These are just simple
patterns for now, there is more to be done to make real world code using
these constructs be optimized and codegen'ed properly on X86.
The existing tests are spiffed up to check that we no longer generate
unnecessary cmov instructions, and that we generate the very important
'xor' to transform bsr which counts the index of the most significant
one bit to the number of leading (most significant) zero bits. Also they
now check that when the variant with defined zero result is used, the
cmov is still produced.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@146974 91177308-0d34-0410-b5e6-96231b3b80d8
I followed three heuristics for deciding whether to set 'true' or
'false':
- Everything target independent got 'true' as that is the expected
common output of the GCC builtins.
- If the target arch only has one way of implementing this operation,
set the flag in the way that exercises the most of codegen. For most
architectures this is also the likely path from a GCC builtin, with
'true' being set. It will (eventually) require lowering away that
difference, and then lowering to the architecture's operation.
- Otherwise, set the flag differently dependending on which target
operation should be tested.
Let me know if anyone has any issue with this pattern or would like
specific tests of another form. This should allow the x86 codegen to
just iteratively improve as I teach the backend how to differentiate
between the two forms, and everything else should remain exactly the
same.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@146370 91177308-0d34-0410-b5e6-96231b3b80d8
We must not issue a bitcast operation for integer-promotion of vector types, because the
location of the values in the vector may be different.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@146150 91177308-0d34-0410-b5e6-96231b3b80d8
libgcc sets the stack limit field in TCB to 256 bytes above the actual
allocated stack limit. This means if the function's stack frame needs
less than 256 bytes, we can just compare the stack pointer with the
stack limit. This should result in lesser calls to __morestack.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145766 91177308-0d34-0410-b5e6-96231b3b80d8
Currently LLVM pads the call to __morestack with a add and sub of 8
bytes to esp. This isn't correct since __morestack expects the call
to be followed directly by a ret.
This commit also adjusts the relevant test-case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145765 91177308-0d34-0410-b5e6-96231b3b80d8
Like V_SET0, these instructions are expanded by ExpandPostRA to xorps /
vxorps so they can participate in execution domain swizzling.
This also makes the AVX variants redundant.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145440 91177308-0d34-0410-b5e6-96231b3b80d8
Conservatively returns zero when the GV does not specify an alignment nor is it
initialized. Previously it returns ABI alignment for type of the GV. However, if
the type is a "packed" type, then the under-specified alignments is attached to
the load / store instructions. In that case, the alignment of the type cannot be
trusted.
rdar://10464621
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145300 91177308-0d34-0410-b5e6-96231b3b80d8
than ABI alignment. These are loads / stores from / to "packed" data structures.
Their alignments are intentionally under-specified.
rdar://10301431
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145273 91177308-0d34-0410-b5e6-96231b3b80d8
was centered around the premise of laying out a loop in a chain, and
then rotating that chain. This is good for preserving contiguous layout,
but bad for actually making sane rotations. In order to keep it safe,
I had to essentially make it impossible to rotate deeply nested loops.
The information needed to correctly reason about a deeply nested loop is
actually available -- *before* we layout the loop. We know the inner
loops are already fused into chains, etc. We lose information the moment
we actually lay out the loop.
The solution was the other alternative for this algorithm I discussed
with Benjamin and some others: rather than rotating the loop
after-the-fact, try to pick a profitable starting block for the loop's
layout, and then use our existing layout logic. I was worried about the
complexity of this "pick" step, but it turns out such complexity is
needed to handle all the important cases I keep teasing out of benchmarks.
This is, I'm afraid, a bit of a work-in-progress. It is still
misbehaving on some likely important cases I'm investigating in Olden.
It also isn't really tested. I'm going to try to craft some interesting
nested-loop test cases, but it's likely to be extremely time consuming
and I don't want to go there until I'm sure I'm testing the correct
behavior. Sadly I can't come up with a way of getting simple, fine
grained test cases for this logic. We need complex loop structures to
even trigger much of it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145183 91177308-0d34-0410-b5e6-96231b3b80d8
heavily on AnalyzeBranch. That routine doesn't behave as we want given
that rotation occurs mid-way through re-ordering the function. Instead
merely check that there are not unanalyzable branching constructs
present, and then reason about the CFG via successor lists. This
actually simplifies my mental model for all of this as well.
The concrete result is that we now will rotate more loop chains. I've
added a test case from Olden highlighting the effect. There is still
a bit more to do here though in order to regain all of the performance
in Olden.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145179 91177308-0d34-0410-b5e6-96231b3b80d8
trampoline forms. Both of these were correct in LLVM 3.0, and we don't
need to support LLVM 2.9 and earlier in mainline.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145174 91177308-0d34-0410-b5e6-96231b3b80d8
pass. This is designed to achieve one of the important optimizations
that the old code placement pass did, but more simply.
This is a somewhat rough and *very* conservative version of the
transform. We could get a lot fancier here if there are profitable cases
to do so. In particular, this only looks for a single pattern, it
insists that the loop backedge being rotated away is the last backedge
in the chain, and it doesn't provide any means of doing better in-loop
placement due to the rotation. However, it appears that it will handle
the important loops I am finding in the LLVM test suite.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145158 91177308-0d34-0410-b5e6-96231b3b80d8
was returning incorrect values in rare cases, and incorrectly marking
exact conversions as inexact in some more common cases. Fixes PR11406, and a
missed optimization in test/CodeGen/X86/fp-stack-O0.ll.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145141 91177308-0d34-0410-b5e6-96231b3b80d8
tablegen patterns for scalar FMA4 operations and intrinsic. Also
add tests for vfmaddsd.
Patch by Jan Sjodin
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145133 91177308-0d34-0410-b5e6-96231b3b80d8
need lots of fanciness around retaining a reference to a Chain's slot in
the BlockToChain map, but that's all gone now. We can just go directly
to allocating the new chain (which will update the mapping for us) and
using it.
Somewhat gross mechanically generated test case replicates the issue
Duncan spotted when actually testing this out.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145120 91177308-0d34-0410-b5e6-96231b3b80d8
conflicts, we should only be adding the first block of the chain to the
list, lest we try to merge into the middle of that chain. Most of the
places we were doing this we already happened to be looking at the first
block, but there is no reason to assume that, and in some cases it was
clearly wrong.
I've added a couple of tests here. One already worked, but I like having
an explicit test for it. The other is reduced from a test case Duncan
reduced for me and used to crash. Now it is handled correctly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145119 91177308-0d34-0410-b5e6-96231b3b80d8
Before:
movabsq $4294967296, %rax ## encoding: [0x48,0xb8,0x00,0x00,0x00,0x00,0x01,0x00,0x00,0x00]
testq %rax, %rdi ## encoding: [0x48,0x85,0xf8]
jne LBB0_2 ## encoding: [0x75,A]
After:
btq $32, %rdi ## encoding: [0x48,0x0f,0xba,0xe7,0x20]
jb LBB0_2 ## encoding: [0x72,A]
btq is usually slower than testq because it doesn't fuse with the jump, but here we're better off
saving one register and a giant movabsq.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145103 91177308-0d34-0410-b5e6-96231b3b80d8
further. This invariant just wasn't going to work in the face of
unanalyzable branches; we need to be resillient to the phenomenon of
chains poking into a loop and poking out of a loop. In fact, we already
were, we just needed to not assert on it.
This was found during a bootstrap with block placement turned on.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145100 91177308-0d34-0410-b5e6-96231b3b80d8
VSHUFPS/VSHUFPD instructions while lowering VECTOR_SHUFFLE node. I check a commuted VSHUFP mask.
The patch was reviewed by Bruno.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145099 91177308-0d34-0410-b5e6-96231b3b80d8
successors, they just are all landing pad successors. We handle this the
same way as no successors. Comments attached for the next person to wade
through here and another lovely test case courtesy of Benjamin Kramer's
bugpoint reduction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145098 91177308-0d34-0410-b5e6-96231b3b80d8
This was a bug in keeping track of the available domains when merging
domain values.
The wrong domain mask caused ExecutionDepsFix to try to move VANDPSYrr
to the integer domain which is only available in AVX2.
Also add an assertion to catch future attempts at emitting AVX2
instructions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145096 91177308-0d34-0410-b5e6-96231b3b80d8
reversed in the function's original ordering, and we happened to
encounter it while handling an outer unnatural CFG structure.
Thanks to the test case reduced from GCC's source by Benjamin Kramer.
This may also fix a crasher in gzip that Duncan reduced for me, but
I haven't yet gotten to testing that one.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145094 91177308-0d34-0410-b5e6-96231b3b80d8
updateTerminator code didn't correctly handle EH terminators in one very
specific case. AnalyzeBranch would find no terminator instruction, and
so the fallback in updateTerminator is to assume fallthrough. This is
correct, but the destination of the fallthrough was assumed to be the
first successor.
This is *almost always* true, but in certain cases the loop
transformations will cause the landing pad to be the first successor!
Instead of this brittle logic, actually look through the successors for
a non-landing-pad accessor, and to assert if more than one is found.
This will hopefully fix some (if not all) of the self host miscompiles
with block placement. Thanks to Benjamin Kramer for reporting, Nick
Lewycky for an initial stab at a reduction, and Duncan for endless
advice on EH (which I know nothing about) as well as reviewing the
actual fix.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145062 91177308-0d34-0410-b5e6-96231b3b80d8
properly account for the *global* probability of the edge being taken.
This manifested as a very large number of unconditional branches to
blocks being merged against the CFG even though they weren't
particularly hot within the CFG.
The fix is to check whether the edge being merged is both locally hot
relative to other successors for the source block, and globally hot
compared to other (unmerged) predecessors of the destination block.
This introduces a new crasher on GCC single-source, but it's currently
behind a flag, and Ben has offered to work on the reduction. =]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145010 91177308-0d34-0410-b5e6-96231b3b80d8
is actually being tested. Also add some FileCheck goodness to much more
carefully ensure that the result is the desired result. Before this test
would only have failed through an assert failure if the underlying fix
were reverted.
Also, add some weight metadata and a comment explaining exactly what is
going on to a trick section of the test case. Originally, we were
getting very unlucky and trying to form a block chain that isn't
actually profitable. I'm working on a fix to avoid forming these
unprofitable chains, and that would also have masked any failure from
this test case. The easy solution is to add some metadata that makes it
*really* profitable to form the bad chain here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@145006 91177308-0d34-0410-b5e6-96231b3b80d8
formation phase and into the initial walk of the basic blocks. We
essentially pre-merge all blocks where unanalyzable fallthrough exists,
as we won't be able to update the terminators effectively after any
reorderings. This is quite a bit more principled as there may be CFGs
where the second half of the unanalyzable pair has some analyzable
predecessor that gets placed first. Then it may get placed next,
implicitly breaking the unanalyzable branch even though we never even
looked at the part that isn't analyzable. I've included a test case that
triggers this (thanks Benjamin yet again!), and I'm hoping to synthesize
some more general ones as I dig into related issues.
Also, to make this new scheme work we have to be able to handle branches
into the middle of a chain, so add this check. We always fallback on the
incoming ordering.
Finally, this starts to really underscore a known limitation of the
current implementation -- we don't consider broken predecessors when
merging successors. This can caused major missed opportunities, and is
something I'm planning on looking at next (modulo more bug reports).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144994 91177308-0d34-0410-b5e6-96231b3b80d8
has a reference to it. Unfortunately, that doesn't work for codegen passes
since we don't get notified of MBB's being deleted (the original BB stays).
Use that fact to our advantage and after printing a function, check if
any of the IL BBs corresponds to a symbol that was not printed. This fixes
pr11202.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144674 91177308-0d34-0410-b5e6-96231b3b80d8
These tests are actually correct, clang was miscompiling ExeDepsFix::processUses.
Evan fixed the miscompilation in r144628.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144630 91177308-0d34-0410-b5e6-96231b3b80d8
block sequence when recovering from unanalyzable control flow
constructs, *always* use the function sequence. I'm not sure why I ever
went down the path of trying to use the loop sequence, it is
fundamentally not the correct sequence to use. We're trying to preserve
the incoming layout in the cases of unreasonable control flow, and that
is only encoded at the function level. We already have a filter to
select *exactly* the sub-set of blocks within the function that we're
trying to form into a chain.
The resulting code layout is also significantly better because of this.
In several places we were ending up with completely unreasonable control
flow constructs due to the ordering chosen by the loop structure for its
internal storage. This change removes a completely wasteful vector of
basic blocks, saving memory allocation in the common case even though it
costs us CPU in the fairly rare case of unnatural loops. Finally, it
fixes the latest crasher reduced out of GCC's single source. Thanks
again to Benjamin Kramer for the reduction, my bugpoint skills failed at
it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144627 91177308-0d34-0410-b5e6-96231b3b80d8
Two new TargetInstrInfo hooks lets the target tell ExecutionDepsFix
about instructions with partial register updates causing false unwanted
dependencies.
The ExecutionDepsFix pass will break the false dependencies if the
updated register was written in the previoius N instructions.
The small loop added to sse-domains.ll runs twice as fast with
dependency-breaking instructions inserted.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144602 91177308-0d34-0410-b5e6-96231b3b80d8
instructions of the two-address operands) in order to avoid inserting copies.
This fixes the few regressions introduced when the two-address hack was
disabled (without regressing the improvements).
rdar://10422688
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144559 91177308-0d34-0410-b5e6-96231b3b80d8
Constant idx case is still done in tablegen but other cases are then expanded
Fixes <rdar://problem/10435460>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144557 91177308-0d34-0410-b5e6-96231b3b80d8
the sum of the edge weights not overflowing uint32, and crashed when
they did. This is generally safe as BranchProbabilityInfo tries to
provide this guarantee. However, the CFG can get modified during codegen
in a way that grows the *sum* of the edge weights. This doesn't seem
unreasonable (imagine just adding more blocks all with the default
weight of 16), but it is hard to come up with a case that actually
triggers 32-bit overflow. Fortuately, the single-source GCC build is
good at this. The solution isn't very pretty, but its no worse than the
previous code. We're already summing all of the edge weights on each
query, we can sum them, check for an overflow, compute a scale, and sum
them again.
I've included a *greatly* reduced test case out of the GCC source that
triggers it. It's a pretty lame test, as it clearly is just barely
triggering the overflow. I'd like to have something that is much more
definitive, but I don't understand the fundamental pattern that triggers
an explosion in the edge weight sums.
The buggy code is duplicated within this file. I'll colapse them into
a single implementation in a subsequent commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144526 91177308-0d34-0410-b5e6-96231b3b80d8
get loop info structures associated with them, and so we need some way
to make forward progress selecting and placing basic blocks. The
technique used here is pretty brutal -- it just scans the list of blocks
looking for the first unplaced candidate. It keeps placing blocks like
this until the CFG becomes tractable.
The cost is somewhat unfortunate, it requires allocating a vector of all
basic block pointers eagerly. I have some ideas about how to simplify
and optimize this, but I'm trying to get the logic correct first.
Thanks to Benjamin Kramer for the reduced test case out of GCC. Sadly
there are other bugs that GCC is tickling that I'm reducing and working
on now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144516 91177308-0d34-0410-b5e6-96231b3b80d8
second algorithm, but only loosely. It is more heavily based on the last
discussion I had with Andy. It continues to walk from the inner-most
loop outward, but there is a key difference. With this algorithm we
ensure that as we visit each loop, the entire loop is merged into
a single chain. At the end, the entire function is treated as a "loop",
and merged into a single chain. This chain forms the desired sequence of
blocks within the function. Switching to a single algorithm removes my
biggest problem with the previous approaches -- they had different
behavior depending on which system triggered the layout. Now there is
exactly one algorithm and one basis for the decision making.
The other key difference is how the chain is formed. This is based
heavily on the idea Andy mentioned of keeping a worklist of blocks that
are viable layout successors based on the CFG. Having this set allows us
to consistently select the best layout successor for each block. It is
expensive though.
The code here remains very rough. There is a lot that needs to be done
to clean up the code, and to make the runtime cost of this pass much
lower. Very much WIP, but this was a giant chunk of code and I'd rather
folks see it sooner than later. Everything remains behind a flag of
course.
I've added a couple of tests to exercise the issues that this iteration
was motivated by: loop structure preservation. I've also fixed one test
that was exhibiting the broken behavior of the previous version.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144495 91177308-0d34-0410-b5e6-96231b3b80d8
It was off by default.
The new register allocators don't have the problems that made it
necessary to reallocate registers during stack slot coloring.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144481 91177308-0d34-0410-b5e6-96231b3b80d8
This test was committed with a bugfix to RemoveCopyByCommutingDef, but
that optimization is no longer triggered by this test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144470 91177308-0d34-0410-b5e6-96231b3b80d8
This test doesn't expose the issue with RAGreedy.
I filed PR11363 to track the missing InlineSpiller feature.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144459 91177308-0d34-0410-b5e6-96231b3b80d8
The test is checking that the output doesn't contains any 'mov '
strings. It does contain movl, though.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144458 91177308-0d34-0410-b5e6-96231b3b80d8
instruction lower optimization" in the pre-RA scheduler.
The optimization, rather the hack, was done before MI use-list was available.
Now we should be able to implement it in a better way, perhaps in the
two-address pass until a MI scheduler is available.
Now that the scheduler has to backtrack to handle call sequences. Adding
artificial scheduling constraints is just not safe. Furthermore, the hack
is not taking all the other scheduling decisions into consideration so it's just
as likely to pessimize code. So I view disabling this optimization goodness
regardless of PR11314.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144267 91177308-0d34-0410-b5e6-96231b3b80d8
The TII.foldMemoryOperand hook preserves implicit operands from the
original instruction. This is not what we want when those implicit
operands refer to the register being spilled.
Implicit operands referring to other registers are preserved.
This fixes PR11347.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144247 91177308-0d34-0410-b5e6-96231b3b80d8
dragonegg self-host buildbot will recover (it is complaining about object
files differing between different build stages). Original commit message:
Add a hack to the scheduler to disable pseudo-two-address dependencies in
basic blocks containing calls. This works around a problem in which
these artificial dependencies can get tied up in calling seqeunce
scheduling in a way that makes the graph unschedulable with the current
approach of using artificial physical register dependencies for calling
sequences. This fixes PR11314.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144188 91177308-0d34-0410-b5e6-96231b3b80d8
During the initial RPO traversal of the basic blocks, remember the ones
that are incomplete because of back-edges from predecessors that haven't
been visited yet.
After the initial RPO, revisit all those loop headers so the incoming
DomainValues on the back-edges can be properly collapsed.
This will properly fix execution domains on software pipelined code,
like the included test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144151 91177308-0d34-0410-b5e6-96231b3b80d8
basic blocks containing calls. This works around a problem in which
these artificial dependencies can get tied up in calling seqeunce
scheduling in a way that makes the graph unschedulable with the current
approach of using artificial physical register dependencies for calling
sequences. This fixes PR11314.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144124 91177308-0d34-0410-b5e6-96231b3b80d8
DomainValues that are only used by "don't care" instructions are now
collapsed to the first possible execution domain after all basic blocks
have been processed. This typically means the PS domain on x86.
For example, the vsel_i64 and vsel_double functions in sse2-blend.ll are
completely collapsed to the PS domain instead of containing a mix of
execution domains created by isel.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@144037 91177308-0d34-0410-b5e6-96231b3b80d8