225 is the default value of inline-threshold. This change will make sure
we have the same inlining behavior as prior to r200886.
As Chandler points out, even though we don't have code in our testing
suite that uses cold attribute, there are larger applications that do
use cold attribute.
r200886 + this commit intend to keep the same behavior as prior to r200886.
We can later on tune the inlinecold-threshold.
The main purpose of r200886 is to help performance of instrumentation based
PGO before we actually hook up inliner with analysis passes such as BPI and BFI.
For instrumentation based PGO, we try to increase inlining of hot functions and
reduce inlining of cold functions by setting inlinecold-threshold.
Another option suggested by Chandler is to use a boolean flag that controls
if we should use OptSizeThreshold for cold functions. The default value
of the boolean flag should not change the current behavior. But it gives us
less freedom in controlling inlining of cold functions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200898 91177308-0d34-0410-b5e6-96231b3b80d8
Added command line option inlinecold-threshold to set threshold for inlining
functions with cold attribute. Listen to the cold attribute when it would
decrease the inline threshold.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200886 91177308-0d34-0410-b5e6-96231b3b80d8
Add the missing transformation strchr(p, 0) -> p + strlen(p) to SimplifyLibCalls
and remove the ToDo comment.
Reviewer: Duncan P.N. Exan Smith
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200736 91177308-0d34-0410-b5e6-96231b3b80d8
It disturbs the layout of the parameters in memory and registers,
leading to problems in the backend.
The plan for optimizing internal inalloca functions going forward is to
essentially SROA the argument memory and demote any captured arguments
(things that aren't trivially written by a load or store) to an indirect
pointer to a static alloca.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200717 91177308-0d34-0410-b5e6-96231b3b80d8
LowerExpectIntrinsic previously only understood the idiom of an expect
intrinsic followed by a comparison with zero. For llvm.expect.i1, the
comparison would be stripped by the early-cse pass.
Patch by Daniel Micay.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200664 91177308-0d34-0410-b5e6-96231b3b80d8
unrolling heuristic per default
Benchmarking on x86_64 (thanks Chandler!) and ARM has shown those options speed
up some benchmarks while not causing any interesting regressions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200621 91177308-0d34-0410-b5e6-96231b3b80d8
LCSSA when we promote to SSA registers inside of LICM.
Currently, this is actually necessary. The promotion logic in LICM uses
SSAUpdater which doesn't understand how to place LCSSA PHI nodes.
Teaching it to do so would be a very significant undertaking. It may be
worthwhile and I've left a FIXME about this in the code as well as
starting a thread on llvmdev to try to figure out the right long-term
solution.
For now, the PR needs to be fixed. Short of using the promition
SSAUpdater to place both the LCSSA PHI nodes and the promoted PHI nodes,
I don't see a cleaner or cheaper way of achieving this. Fortunately,
LCSSA is relatively lazy and sparse -- it should only update
instructions which need it. We can also skip the recursive variant when
we don't promote to SSA values.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200612 91177308-0d34-0410-b5e6-96231b3b80d8
cost so that they don't impact the vector bonus. Fundamentally, counting
unsimplified instructions is just *wrong*; it will continue to introduce
instability as things which do not generate code bizarrely impact
inlining. For example, sufficiently nested inlined functions could turn
off the vector bonus with lifetime markers just like the debug
intrinsics do. =/
This is a short-term tactical fix. Long term, I think we need to remove
the vector bonus entirely. That's a separate patch and discussion
though.
The patch to fix this provided by Dario Domizioli. I've added some
comments about the planned direction and used a heavily pruned form of
debug info intrinsics for the test case. While this debug info doesn't
work or "do" anything useful, it lets us easily test all manner of
interference easily, and I suspect this will not be the last time we
want to craft a pattern where debug info interferes with the inliner in
a problematic way.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200609 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r200576. It broke 32-bit self-host builds by
vectorizing two calls to @llvm.bswap.i64, which we then fail to expand.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200602 91177308-0d34-0410-b5e6-96231b3b80d8
transform accordingly. Based on similar code from Loop vectorization.
Subsequent commits will include vectorization of function calls to
vector intrinsics and form function calls to vector library calls.
Patch by Raul Silvera! (Much delayed due to my not running dcommit)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200576 91177308-0d34-0410-b5e6-96231b3b80d8
loop vectorizer to not do so when runtime pointer checks are needed and
share code with the new (not yet enabled) load/store saturation runtime
unrolling. Also ensure that we only consider the runtime checks when the
loop hasn't already been vectorized. If it has, the runtime check cost
has already been paid.
I've fleshed out a test case to cover the scalar unrolling as well as
the vector unrolling and comment clearly why we are or aren't following
the pattern.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200530 91177308-0d34-0410-b5e6-96231b3b80d8
This doesn't set errno, so this should be OK.
Also update the documentation to explicitly state
that errno are not set.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200501 91177308-0d34-0410-b5e6-96231b3b80d8
preserve loop simplify of enclosing loops.
The problem here starts with LoopRotation which ends up cloning code out
of the latch into the new preheader it is buidling. This can create
a new edge from the preheader into the exit block of the loop which
breaks LoopSimplify form. The code tries to fix this by splitting the
critical edge between the latch and the exit block to get a new exit
block that only the latch dominates. This sadly isn't sufficient.
The exit block may be an exit block for multiple nested loops. When we
clone an edge from the latch of the inner loop to the new preheader
being built in the outer loop, we create an exiting edge from the outer
loop to this exit block. Despite breaking the LoopSimplify form for the
inner loop, this is fine for the outer loop. However, when we split the
edge from the inner loop to the exit block, we create a new block which
is in neither the inner nor outer loop as the new exit block. This is
a predecessor to the old exit block, and so the split itself takes the
outer loop out of LoopSimplify form. We need to split every edge
entering the exit block from inside a loop nested more deeply than the
exit block in order to preserve all of the loop simplify constraints.
Once we try to do that, a problem with splitting critical edges
surfaces. Previously, we tried a very brute force to update LoopSimplify
form by re-computing it for all exit blocks. We don't need to do this,
and doing this much will sometimes but not always overlap with the
LoopRotate bug fix. Instead, the code needs to specifically handle the
cases which can start to violate LoopSimplify -- they aren't that
common. We need to see if the destination of the split edge was a loop
exit block in simplified form for the loop of the source of the edge.
For this to be true, all the predecessors need to be in the exact same
loop as the source of the edge being split. If the dest block was
originally in this form, we have to split all of the deges back into
this loop to recover it. The old mechanism of doing this was
conservatively correct because at least *one* of the exiting blocks it
rewrote was the DestBB and so the DestBB's predecessors were fixed. But
this is a much more targeted way of doing it. Making it targeted is
important, because ballooning the set of edges touched prevents
LoopRotate from being able to split edges *it* needs to split to
preserve loop simplify in a coherent way -- the critical edge splitting
would sometimes find the other edges in need of splitting but not
others.
Many, *many* thanks for help from Nick reducing these test cases
mightily. And helping lots with the analysis here as this one was quite
tricky to track down.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200393 91177308-0d34-0410-b5e6-96231b3b80d8
because of the inside-out run of LoopSimplify in the LoopPassManager and
the fact that LoopSimplify couldn't be "preserved" across two
independent LoopPassManagers.
Anyways, in that case, IndVars wasn't correctly preserving an LCSSA PHI
node because it thought it was rewriting (via SCEV) the incoming value
to a loop invariant value. While it may well be invariant for the
current loop, it may be rewritten in terms of an enclosing loop's
values. This in and of itself is fine, as the LCSSA PHI node in the
enclosing loop for the inner loop value we're rewriting will have its
own LCSSA PHI node if used outside of the enclosing loop. With me so
far?
Well, the current loop and the enclosing loop may share an exiting
block and exit block, and when they do they also share LCSSA PHI nodes.
In this case, its not valid to RAUW through the LCSSA PHI node.
Expected crazy test included.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200372 91177308-0d34-0410-b5e6-96231b3b80d8
When simplifycfg moves an instruction, it must drop metadata it doesn't know
is still valid with the preconditions changes. In particular, it must drop
the range and tbaa metadata.
The patch implements this with an utility function to drop all metadata not
in a white list.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200322 91177308-0d34-0410-b5e6-96231b3b80d8
vectorizer, placing it behind an off-by-default flag.
It turns out that block frequency isn't what we want at all, here or
elsewhere. This has been I think a nagging feeling for several of us
working with it, but Arnold has given some really nice simple examples
where the results are so comprehensively wrong that they aren't useful.
I'm planning to email the dev list with a summary of why its not really
useful and a couple of ideas about how to better structure these types
of heuristics.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200294 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
I searched Transforms/ and Analysis/ for 'ByVal' and updated those call
sites to check for inalloca if appropriate.
I added tests for any change that would allow an optimization to fire on
inalloca.
Reviewers: nlewycky
Differential Revision: http://llvm-reviews.chandlerc.com/D2449
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200281 91177308-0d34-0410-b5e6-96231b3b80d8
The vectorizer takes a loop like this and widens all instructions except for the
store. The stores are scalarized/unrolled and hidden behind an "if" block.
for (i = 0; i < 128; ++i) {
if (a[i] < 10)
a[i] += val;
}
for (i = 0; i < 128; i+=2) {
v = a[i:i+1];
v0 = (extract v, 0) + 10;
v1 = (extract v, 1) + 10;
if (v0 < 10)
a[i] = v0;
if (v1 < 10)
a[i] = v1;
}
The vectorizer relies on subsequent optimizations to sink instructions into the
conditional block where they are anticipated.
The flag "vectorize-num-stores-pred" controls whether and how many stores to
handle this way. Vectorization of conditional stores is disabled per default for
now.
This patch also adds a change to the heuristic when the flag
"enable-loadstore-runtime-unroll" is enabled (off by default). It unrolls small
loops until load/store ports are saturated. This heuristic uses TTI's
getMaxUnrollFactor as a measure for load/store ports.
I also added a second flag -enable-cond-stores-vec. It will enable vectorization
of conditional stores. But there is no cost model for vectorization of
conditional stores in place yet so this will not do good at the moment.
rdar://15892953
Results for x86-64 -O3 -mavx +/- -mllvm -enable-loadstore-runtime-unroll
-vectorize-num-stores-pred=1 (before the BFI change):
Performance Regressions:
Benchmarks/Ptrdist/yacr2/yacr2 7.35% (maze3() is identical but 10% slower)
Applications/siod/siod 2.18%
Performance improvements:
mesa -4.42%
libquantum -4.15%
With a patch that slightly changes the register heuristics (by subtracting the
induction variable on both sides of the register pressure equation, as the
induction variable is probably not really unrolled):
Performance Regressions:
Benchmarks/Ptrdist/yacr2/yacr2 7.73%
Applications/siod/siod 1.97%
Performance Improvements:
libquantum -13.05% (we now also unroll quantum_toffoli)
mesa -4.27%
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200270 91177308-0d34-0410-b5e6-96231b3b80d8
uint32.
When folding branches to common destination, the updated branch weights
can exceed uint32 by more than factor of 2. We should keep halving the
weights until they can fit into uint32.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200262 91177308-0d34-0410-b5e6-96231b3b80d8
cold loops as-if they were being optimized for size.
Nothing fancy here. Simply test case included. The nice thing is that we
can now incrementally build on top of this to drive other heuristics.
All of the infrastructure work is done to get the profile information
into this layer.
The remaining work necessary to make this a fully general purpose loop
unroller for very hot loops is to make it a fully general purpose loop
unroller. Things I know of but am not going to have time to benchmark
and fix in the immediate future:
1) Don't disable the entire pass when the target is lacking vector
registers. This really doesn't make any sense any more.
2) Teach the unroller at least and the vectorizer potentially to handle
non-if-converted loops. This is trivial for the unroller but hard for
the vectorizer.
3) Compute the relative hotness of the loop and thread that down to the
various places that make cost tradeoffs (very likely only the
unroller makes sense here, and then only when dealing with loops that
are small enough for unrolling to not completely blow out the LSD).
I'm still dubious how useful hotness information will be. So far, my
experiments show that if we can get the correct logic for determining
when unrolling actually helps performance, the code size impact is
completely unimportant and we can unroll in all cases. But at least
we'll no longer burn code size on cold code.
One somewhat unrelated idea that I've had forever but not had time to
implement: mark all functions which are only reachable via the global
constructors rigging in the module as optsize. This would also decrease
the impact of any more aggressive heuristics here on code size.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200219 91177308-0d34-0410-b5e6-96231b3b80d8
to stabilize a test that really is trying to test generic behavior and
not a specific target's behavior.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200215 91177308-0d34-0410-b5e6-96231b3b80d8
powers of two. This is essentially always the correct thing given the
impact on alignment, scaling factors that can be used in addressing
modes, etc. Also, fix the management of the unroll vs. small loop cost
to more accurately model things with this world.
Enhance a test case to actually exercise more of the unroll machinery if
using synthetic constants rather than a specific target model. Before
this change, with the added flags this test will unroll 3 times instead
of either 2 or 4 (the two sensible answers).
While I don't expect this to make a huge difference, if there are lots
of loops sitting right on the edge of hitting the 'small unroll' factor,
they might change behavior. However, I've benchmarked moving the small
loop cost up and down in many various ways and by a huge factor (2x)
without seeing more than 0.2% code size growth. Small adjustments such
as the series that led up here have led to about 1% improvement on some
benchmarks, but it is very close to the noise floor so I mostly checked
that nothing regressed. Let me know if you see bad behavior on other
targets but I don't expect this to be a sufficiently dramatic change to
trigger anything.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200213 91177308-0d34-0410-b5e6-96231b3b80d8
the loops in a function, and teach LICM to work in the presance of
LCSSA.
Previously, LCSSA was a loop pass. That made passes requiring it also be
loop passes and unable to depend on function analysis passes easily. It
also caused outer loops to have a different "canonical" form from inner
loops during analysis. Instead, we go into LCSSA form and preserve it
through the loop pass manager run.
Note that this has the same problem as LoopSimplify that prevents
enabling its verification -- loop passes which run at the end of the loop
pass manager and don't preserve these are valid, but the subsequent loop
pass runs of outer loops that do preserve this pass trigger too much
verification and fail because the inner loop no longer verifies.
The other problem this exposed is that LICM was completely unable to
handle LCSSA form. It didn't preserve it and it actually would give up
on moving instructions in many cases when they were used by an LCSSA phi
node. I've taught LICM to support detecting LCSSA-form PHI nodes and to
hoist and sink around them. This may actually let LICM fire
significantly more because we put everything into LCSSA form to rotate
the loop before running LICM. =/ Now LICM should handle that fine and
preserve it correctly. The down side is that LICM has to require LCSSA
in order to preserve it. This is just a fact of life for LCSSA. It's
entirely possible we should completely remove LCSSA from the optimizer.
The test updates are essentially accomodating LCSSA phi nodes in the
output of LICM, and the fact that we now completely sink every
instruction in ashr-crash below the loop bodies prior to unrolling.
With this change, LCSSA is computed only three times in the pass
pipeline. One of them could be removed (and potentially a SCEV run and
a separate LoopPassManager entirely!) if we had a LoopPass variant of
InstCombine that ran InstCombine on the loop body but refused to combine
away LCSSA PHI nodes. Currently, this also prevents loop unrolling from
being in the same loop pass manager is rotate, LICM, and unswitch.
There is one thing that I *really* don't like -- preserving LCSSA in
LICM is quite expensive. We end up having to re-run LCSSA twice for some
loops after LICM runs because LICM can undo LCSSA both in the current
loop and the parent loop. I don't really see good solutions to this
other than to completely move away from LCSSA and using tools like
SSAUpdater instead.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200067 91177308-0d34-0410-b5e6-96231b3b80d8
Sweep the codebase for common typos. Includes some changes to visible function
names that were misspelt.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200018 91177308-0d34-0410-b5e6-96231b3b80d8
Argument promotion can replace an argument of a call with an alloca. This
requires clearing the tail marker as it is very likely that the callee is now
using an alloca in the caller.
This fixes pr14710.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199909 91177308-0d34-0410-b5e6-96231b3b80d8
function and a FunctionPass.
This has many benefits. The motivating use case was to be able to
compute function analysis passes *after* running LoopSimplify (to avoid
invalidating them) and then to run other passes which require
LoopSimplify. Specifically passes like unrolling and vectorization are
critical to wire up to BranchProbabilityInfo and BlockFrequencyInfo so
that they can be profile aware. For the LoopVectorize pass the only
things in the way are LoopSimplify and LCSSA. This fixes LoopSimplify
and LCSSA is next on my list.
There are also a bunch of other benefits of doing this:
- It is now very feasible to make more passes *preserve* LoopSimplify
because they can simply run it after changing a loop. Because
subsequence passes can assume LoopSimplify is preserved we can reduce
the runs of this pass to the times when we actually mutate a loop
structure.
- The new pass manager should be able to more easily support loop passes
factored in this way.
- We can at long, long last observe that LoopSimplify is preserved
across SCEV. This *halves* the number of times we run LoopSimplify!!!
Now, getting here wasn't trivial. First off, the interfaces used by
LoopSimplify are all over the map regarding how analysis are updated. We
end up with weird "pass" parameters as a consequence. I'll try to clean
at least some of this up later -- I'll have to have it all clean for the
new pass manager.
Next up I discovered a really frustrating bug. LoopUnroll *claims* to
preserve LoopSimplify. That's actually a lie. But the way the
LoopPassManager ends up running the passes, it always ran LoopSimplify
on the unrolled-into loop, rectifying this oversight before any
verification could kick in and point out that in fact nothing was
preserved. So I've added code to the unroller to *actually* simplify the
surrounding loop when it succeeds at unrolling.
The only functional change in the test suite is that we now catch a case
that was previously missed because SCEV and other loop transforms see
their containing loops as simplified and thus don't miss some
opportunities. One test case has been converted to check that we catch
this case rather than checking that we miss it but at least don't get
the wrong answer.
Note that I have #if-ed out all of the verification logic in
LoopSimplify! This is a temporary workaround while extracting these bits
from the LoopPassManager. Currently, there is no way to have a pass in
the LoopPassManager which preserves LoopSimplify along with one which
does not. The LPM will try to verify on each loop in the nest that
LoopSimplify holds but the now-Function-pass cannot distinguish what
loop is being verified and so must try to verify all of them. The inner
most loop is clearly no longer simplified as there is a pass which
didn't even *attempt* to preserve it. =/ Once I get LCSSA out (and maybe
LoopVectorize and some other fixes) I'll be able to re-enable this check
and catch any places where we are still failing to preserve
LoopSimplify. If this causes problems I can back this out and try to
commit *all* of this at once, but so far this seems to work and allow
much more incremental progress.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199884 91177308-0d34-0410-b5e6-96231b3b80d8
This logic hadn't been updated to handle FastMathFlags, and it took me a while to detect it because it doesn't show up in a simple search for CreateFAdd.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199629 91177308-0d34-0410-b5e6-96231b3b80d8
intrinsics.
Reported on the list by Evan with a couple of attempts to fix, but it
took a while to dig down to the root cause. There are two overlapping
bugs here, both centering around the circumstance of discovering
a memcpy operand which is known to be completely outside the bounds of
the alloca.
First, we need to kill the *other* side of the memcpy if it was added to
this alloca. Otherwise we'll factor it into our slicing and try to
rewrite it even though we know for a fact that it is dead. This is made
more tricky because we can visit the sides in either order. So we have
to both kill the other side and skip instructions marked as dead. The
latter really should be goodness in every case, but here is a matter of
correctness.
Second, we need to actually remove the *uses* of the alloca by the
memcpy when queuing it for later deletion. Otherwise it may still be
using the alloca when we go to promote it (if the rewrite re-uses the
existing alloca instruction). Do this by factoring out the
use-clobbering used when for nixing a Phi argument and re-using it
across the operands of a to-be-deleted instruction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199590 91177308-0d34-0410-b5e6-96231b3b80d8
a reduction.
Really. Under certain circumstances (the use list of an instruction has to be
set up right - hence the extra pass in the test case) we would not recognize
when a value in a potential reduction cycle was used multiple times by the
reduction cycle.
Fixes PR18526.
radar://15851149
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199570 91177308-0d34-0410-b5e6-96231b3b80d8
There has been an old FIXME to find the right cut-off for when it's worth
analyzing and potentially transforming a switch to a lookup table.
The switches always have two or more cases. I could not measure any speed-up
by transforming a switch with two cases. A switch with three cases gets a nice
speed-up, and I couldn't measure any compile-time regression, so I think this
is the right threshold.
In a Clang self-host, this causes 480 new switches to be transformed,
and reduces the final binary size with 8 KB.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199294 91177308-0d34-0410-b5e6-96231b3b80d8
case when the lookup table doesn't have any holes.
This means we can build a lookup table for switches like this:
switch (x) {
case 0: return 1;
case 1: return 2;
case 2: return 3;
case 3: return 4;
default: exit(1);
}
The default case doesn't yield a constant result here, but that doesn't matter,
since a default result is only necessary for filling holes in the lookup table,
and this table doesn't have any holes.
This makes us transform 505 more switches in a clang bootstrap, and shaves 164 KB
off the resulting clang binary.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@199025 91177308-0d34-0410-b5e6-96231b3b80d8
1- Use the line_iterator class to read profile files.
2- Allow comments in profile file. Lines starting with '#'
are completely ignored while reading the profile.
3- Add parsing support for discriminators and indirect call samples.
Our external profiler can emit more profile information that we are
currently not handling. This patch does not add new functionality to
support this information, but it allows profile files to provide it.
I will add actual support later on (for at least one of these
features, I need support for DWARF discriminators in Clang).
A sample line may contain the following additional information:
Discriminator. This is used if the sampled program was compiled with
DWARF discriminator support
(http://wiki.dwarfstd.org/index.php?title=Path_Discriminators). This
is currently only emitted by GCC and we just ignore it.
Potential call targets and samples. If present, this line contains a
call instruction. This models both direct and indirect calls. Each
called target is listed together with the number of samples. For
example,
130: 7 foo:3 bar:2 baz:7
The above means that at relative line offset 130 there is a call
instruction that calls one of foo(), bar() and baz(). With baz()
being the relatively more frequent call target.
Differential Revision: http://llvm-reviews.chandlerc.com/D2355
4- Simplify format of profile input file.
This implements earlier suggestions to simplify the format of the
sample profile file. The symbol table is not necessary and function
profiles do not need to know the number of samples in advance.
Differential Revision: http://llvm-reviews.chandlerc.com/D2419
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198973 91177308-0d34-0410-b5e6-96231b3b80d8
This adds a propagation heuristic to convert instruction samples
into branch weights. It implements a similar heuristic to the one
implemented by Dehao Chen on GCC.
The propagation proceeds in 3 phases:
1- Assignment of block weights. All the basic blocks in the function
are initial assigned the same weight as their most frequently
executed instruction.
2- Creation of equivalence classes. Since samples may be missing from
blocks, we can fill in the gaps by setting the weights of all the
blocks in the same equivalence class to the same weight. To compute
the concept of equivalence, we use dominance and loop information.
Two blocks B1 and B2 are in the same equivalence class if B1
dominates B2, B2 post-dominates B1 and both are in the same loop.
3- Propagation of block weights into edges. This uses a simple
propagation heuristic. The following rules are applied to every
block B in the CFG:
- If B has a single predecessor/successor, then the weight
of that edge is the weight of the block.
- If all the edges are known except one, and the weight of the
block is already known, the weight of the unknown edge will
be the weight of the block minus the sum of all the known
edges. If the sum of all the known edges is larger than B's weight,
we set the unknown edge weight to zero.
- If there is a self-referential edge, and the weight of the block is
known, the weight for that edge is set to the weight of the block
minus the weight of the other incoming edges to that block (if
known).
Since this propagation is not guaranteed to finalize for every CFG, we
only allow it to proceed for a limited number of iterations (controlled
by -sample-profile-max-propagate-iterations). It currently uses the same
GCC default of 100.
Before propagation starts, the pass builds (for each block) a list of
unique predecessors and successors. This is necessary to handle
identical edges in multiway branches. Since we visit all blocks and all
edges of the CFG, it is cleaner to build these lists once at the start
of the pass.
Finally, the patch fixes the computation of relative line locations.
The profiler emits lines relative to the function header. To discover
it, we traverse the compilation unit looking for the subprogram
corresponding to the function. The line number of that subprogram is the
line where the function begins. That becomes line zero for all the
relative locations.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198972 91177308-0d34-0410-b5e6-96231b3b80d8
for (i = 0; i < N; ++i)
A[i * Stride1] += B[i * Stride2];
We take loops like this and check that the symbolic strides 'Strided1/2' are one
and drop to the scalar loop if they are not.
This is currently disabled by default and hidden behind the flag
'enable-mem-access-versioning'.
radar://13075509
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198950 91177308-0d34-0410-b5e6-96231b3b80d8
This doesn't seem to have actually broken anything. It was paranoia
on my part. Trying again now that bots are more stable.
This is a follow up of the r198338 commit that added truncates for
lcssa phi nodes. Sinking the truncates below the phis cleans up the
loop and simplifies subsequent analysis within the indvars pass.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198678 91177308-0d34-0410-b5e6-96231b3b80d8
This is a follow up of the r198338 commit that added truncates for
lcssa phi nodes. Sinking the truncates below the phis cleans up the
loop and simplifies subsequent analysis within the indvars pass.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198654 91177308-0d34-0410-b5e6-96231b3b80d8
Now with a fix for PR18384: ValueHandleBase::ValueIsDeleted.
We need to invalidate SCEV's loop info when we delete a block, even if no values are hoisted.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198631 91177308-0d34-0410-b5e6-96231b3b80d8
This commit was the source of crasher PR18384:
While deleting: label %for.cond127
An asserting value handle still pointed to this value!
UNREACHABLE executed at llvm/lib/IR/Value.cpp:671!
Reverting to get the builders green, feel free to re-land after fixing up.
(Renato has a handy isolated repro if you need it.)
This reverts commit r198478.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198503 91177308-0d34-0410-b5e6-96231b3b80d8
getSCEV for an ashr instruction creates an intermediate zext
expression when it truncates its operand.
The operand is initially inside the loop, so the narrow zext
expression has a non-loop-invariant loop disposition.
LoopSimplify then runs on an outer loop, hoists the ashr operand, and
properly invalidate the SCEVs that are mapped to value.
The SCEV expression for the ashr is now an AddRec with the hoisted
value as the now loop-invariant start value.
The LoopDisposition of this wide value was properly invalidated during
LoopSimplify.
However, if we later get the ashr SCEV again, we again try to create
the intermediate zext expression. We get the same SCEV that we did
earlier, and it is still cached because it was never mapped to a
Value. When we try to create a new AddRec we abort because we're using
the old non-loop-invariant LoopDisposition.
I don't have a solution for this other than to clear LoopDisposition
when LoopSimplify hoists things.
I think the long-term strategy should be to perform LoopSimplify on
all loops before computing SCEV and before running any loop opts on
individual loops. It's possible we may want to rerun LoopSimplify on
individual loops, but it should rarely do anything, so rarely require
invalidating SCEV.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198478 91177308-0d34-0410-b5e6-96231b3b80d8
The loop rerolling pass was failing with an assertion failure from a
failed cast on loops like this:
void foo(int *A, int *B, int m, int n) {
for (int i = m; i < n; i+=4) {
A[i+0] = B[i+0] * 4;
A[i+1] = B[i+1] * 4;
A[i+2] = B[i+2] * 4;
A[i+3] = B[i+3] * 4;
}
}
The code was casting the SCEV-expanded code for the new
induction variable to a phi-node. When the loop had a non-constant
lower bound, the SCEV expander would end the code expansion with an
add insted of a phi node and the cast would fail.
It looks like the cast to a phi node was only needed to get the
induction variable value coming from the backedge to compute the end
of loop condition. This patch changes the loop reroller to compare
the induction variable to the number of times the backedge is taken
instead of the iteration count of the loop. In other words, we stop
the loop when the current value of the induction variable ==
IterationCount-1. Previously, the comparison was comparing the
induction variable value from the next iteration == IterationCount.
This problem only seems to occur on 32-bit targets. For some reason,
the loop is not rerolled on 64-bit targets.
PR18290
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198425 91177308-0d34-0410-b5e6-96231b3b80d8
cycles
This allows the value equality check to work even if we don't have a dominator
tree. Also add some more comments.
I was worried about compile time impacts and did not implement reachability but
used the dominance check in the initial patch. The trade-off was that the
dominator tree was required.
The llvm utility function isPotentiallyReachable cuts off the recursive search
after 32 visits. Testing did not show any compile time regressions showing my
worries unjustfied.
No compile time or performance regressions at O3 -flto -mavx on test-suite +
externals.
Addresses review comments from r198290.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198400 91177308-0d34-0410-b5e6-96231b3b80d8
When widening an IV to remove s/zext, we generally try to eliminate
the original narrow IV. However, LCSSA phi nodes outside the loop were
still using the original IV. Clean this up more aggressively to avoid
redundancy in generated code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198338 91177308-0d34-0410-b5e6-96231b3b80d8
When there are cycles in the value graph we have to be careful interpreting
"Value*" identity as "value" equivalence. We interpret the value of a phi node
as the value of its operands.
When we check for value equivalence now we make sure that the "Value*" dominates
all cycles (phis).
%0 = phi [%noaliasval, %addr2]
%l = load %ptr
%addr1 = gep @a, 0, %l
%addr2 = gep @a, 0, (%l + 1)
store %ptr ...
Before this patch we would return NoAlias for (%0, %addr1) which is wrong
because the value of the load is from different iterations of the loop.
Tested on x86_64 -mavx at O3 and O3 -flto with no performance or compile time
regressions.
PR18068
radar://15653794
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198290 91177308-0d34-0410-b5e6-96231b3b80d8
widespread glibc bugs.
The glibc implementation of exp10 has a very serious precision bug in
version 2.15 (and older versions). This is still very widely used (the
current Ubuntu LTS for example uses it) and so it isn't reasonable to
make transforms that produce these functions. This fixes many
miscompiles introduced when we started transforming pow(10.0, ...) into
exp10, and it may have fixed other latent miscompiles where exp10
provided sufficient precision but exp10f did not.
This is all really horrible. The primary bug has been fixed for over
a year and glibc 2.18 works correctly for the test cases I have, but it
will be 2017 before the LTS using 2.15 is no longer supported by Ubuntu
(and thus reasonable for folks to be relying on). =[ We're either going
to need to live without these optimizations, or find a way to switch
behavior more dynamically than using simply the fact that the OS is
"Linux".
To make matters worse, there appears to be significant testing and
fixing of numerous other bugs in the exp10 family of functions right now
in glibc. While those haven't been causing problems I've seen in the
wild, it gives me concerns that we may need to wait until an even later
release of glibc before we can reliably transform code into exp10.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@198093 91177308-0d34-0410-b5e6-96231b3b80d8
Split sadd.with.overflow into add + sadd.with.overflow to allow
analysis and optimization. This should ideally be done after
InstCombine, which can perform code motion (eventually indvars should
run after all canonical instcombines). We want ISEL to recombine the
add and the check, at least on x86.
This is currently under an option for reducing live induction
variables: -liv-reduce. The next step is reducing liveness of IVs that
are live out of the overflow check paths. Once the related
optimizations are fully developed, reviewed and tested, I do expect
this to become default.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197926 91177308-0d34-0410-b5e6-96231b3b80d8
If the Scalarizer scalarized a vector PHI but could not scalarize
all uses of it, it would insert a series of insertelements to reconstruct
the vector PHI value from the scalar ones. The problem was that it would
emit these insertelements immediately after the PHI, even if there were
other PHIs after it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197909 91177308-0d34-0410-b5e6-96231b3b80d8
If we happen to eliminate every case in a switch that has branch
weights, we currently try to create metadata for the one remaining
branch, triggering an assert. Instead, we need to check that the
metadata we're trying to create is sensible.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197791 91177308-0d34-0410-b5e6-96231b3b80d8
A phi node operand or an instruction operand could be a constant expression that
can trap (division). Check that we don't vectorize such cases.
PR16729
radar://15653590
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197449 91177308-0d34-0410-b5e6-96231b3b80d8
through an invoke instruction.
The original patch for this was written by Mark Seaborn, but I've
reworked his test case into the existing returns_twice test case and
implemented the fix by the prior refactoring to actually run the cost
analysis over invoke instructions, and then here fixing our detection of
the returns_twice attribute to work for both calls and invokes. We never
noticed because we never saw an invoke. =[
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197216 91177308-0d34-0410-b5e6-96231b3b80d8
handles terminator instructions.
The inline cost analysis inheritted some pretty rough handling of
terminator insts from the original cost analysis, and then made it much,
much worse by factoring all of the important analyses into a separate
instruction visitor. That instruction visitor never visited the
terminator.
This works fine for things like conditional branches, but for many other
things we simply computed The Wrong Value. First example are
unconditional branches, which should be free but were counted as full
cost. This is most significant for conditional branches where the
condition simplifies and folds during inlining. We paid a 1 instruction
tax on every branch in a straight line specialized path. =[
Oh, we also claimed that the unreachable instruction had cost.
But it gets worse. Let's consider invoke. We never applied the call
penalty. We never accounted for the cost of the arguments. Nope. Worse
still, we didn't handle the *correctness* constraints of not inlining
recursive invokes, or exception throwing returns_twice functions. Oops.
See PR18206. Sadly, PR18206 requires yet another fix, but this
refactoring is at least a huge step in that direction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197215 91177308-0d34-0410-b5e6-96231b3b80d8
GlobalOpt's CleanupConstantGlobalUsers function uses a worklist array to manage
constant users to be visited. The pointers in this array need to be weak
handles because when we delete a constant array, we may also be holding a
pointer to one of its elements (or an element of one of its elements if we're
dealing with an array of arrays) in the worklist.
Fixes PR17347.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@197178 91177308-0d34-0410-b5e6-96231b3b80d8
This avoids creating branch weight metadata of length one when we fold
cases into the default of a switch instruction, which was triggering
an assert.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196845 91177308-0d34-0410-b5e6-96231b3b80d8
This fixes PR17872. This bug can lead to C++ destructors not being
called when they should be, when an exception is thrown.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196711 91177308-0d34-0410-b5e6-96231b3b80d8
Before this change, inlining one "invoke" into an outer "invoke" call
site can lead to the outer landingpad's catch/filter clauses being
copied multiple times into the resulting landingpad. This happens:
* when the inlined function contains multiple "resume" instructions,
because forwardResume() copies the clauses but is called multiple
times;
* when the inlined function contains a "resume" and a "call", because
HandleCallsInBlockInlinedThroughInvoke() copies the clauses but is
redundant with forwardResume().
Fix this by deduplicating the code.
This problem doesn't lead to any incorrect execution; it's only
untidy.
This change will make fixing PR17872 a little easier.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196710 91177308-0d34-0410-b5e6-96231b3b80d8
Test is platform independent, but I don't want to force vector-width, or
that could spoil the pragma test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196539 91177308-0d34-0410-b5e6-96231b3b80d8
The intended behaviour is to force vectorization on the presence
of the flag (either turn on or off), and to continue the behaviour
as expected in its absence. Tests were added to make sure the all
cases are covered in opt. No tests were added in other tools with
the assumption that they should use the PassManagerBuilder in the
same way.
This patch also removes the outdated -late-vectorize flag, which was
on by default and not helping much.
The pragma metadata is being attached to the same place as other loop
metadata, but nothing forbids one from attaching it to a function
(to enable #pragma optimize) or basic blocks (to hint the basic-block
vectorizers), etc. The logic should be the same all around.
Patches to Clang to produce the metadata will be produced after the
initial implementation is agreed upon and committed. Patches to other
vectorizers (such as SLP and BB) will be added once we're happy with
the pass manager changes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196537 91177308-0d34-0410-b5e6-96231b3b80d8
We were creating external uses for scalar values in MustGather entries that also
had a ScalarToTreeEntry (they also are present in a vectorized tuple). This
meant we would keep a value 'alive' as a scalar and vectorized causing havoc.
This is not necessary because when we create a MustGather vector we explicitly
create external uses entries for the insertelement instructions of the
MustGather vector elements.
Fixes PR18129.
radar://15582184
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196508 91177308-0d34-0410-b5e6-96231b3b80d8
This patch tries to avoid unrelated changes other than fixing a few
hyphen-related ambiguities and contractions in nearby lines.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196471 91177308-0d34-0410-b5e6-96231b3b80d8
clang enables vectorization at optimization levels > 1 and size level < 2. opt
should behave similarily.
Loop vectorization and SLP vectorization can be disabled with the flags
-disable-(loop/slp)-vectorization.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196294 91177308-0d34-0410-b5e6-96231b3b80d8
The profile file parser needed some tests for its parsing actions.
This adds tests for each of the error messages emitted by the parser.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@196106 91177308-0d34-0410-b5e6-96231b3b80d8
may be removed and optimized in future iterations. Instead we save a list of basic blocks that we need to CSE.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195791 91177308-0d34-0410-b5e6-96231b3b80d8
In signed arithmetic we could end up with an i64 trip count for an i32 phi.
Because it is signed arithmetic we know that this is only defined if the i32
does not wrap. It is therefore safe to truncate the i64 trip count to a i32
value.
Fixes PR18049.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195787 91177308-0d34-0410-b5e6-96231b3b80d8
we generate PHI nodes with multiple entries from the same basic block but
with different values. Enabling CSE on ExtractElement instructions make sure
that all of the RAUWed instructions are the same.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195773 91177308-0d34-0410-b5e6-96231b3b80d8
Short description.
This issue is about case of treating pointers as integers.
We treat pointers as different if they references different address space.
At the same time, we treat pointers equal to integers (with machine address
width). It was a point of false-positive. Consider next case on 32bit machine:
void foo0(i32 addrespace(1)* %p)
void foo1(i32 addrespace(2)* %p)
void foo2(i32 %p)
foo0 != foo1, while
foo1 == foo2 and foo0 == foo2.
As you can see it breaks transitivity. That means that result depends on order
of how functions are presented in module. Next order causes merging of foo0
and foo1: foo2, foo0, foo1
First foo0 will be merged with foo2, foo0 will be erased. Second foo1 will be
merged with foo2.
Depending on order, things could be merged we don't expect to.
The fix:
Forbid to treat any pointer as integer, except for those, who belong to address space 0.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195769 91177308-0d34-0410-b5e6-96231b3b80d8
We are going to drop debug info without a version number or with a different
version number, to make sure we don't crash when we see bitcode files with
different debug info metadata format.
Make tests more robust by removing hard-coded metadata numbers in CHECK lines.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195535 91177308-0d34-0410-b5e6-96231b3b80d8
We are going to drop debug info without a version number or with a different
version number, to make sure we don't crash when we see bitcode files with
different debug info metadata format.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195504 91177308-0d34-0410-b5e6-96231b3b80d8
If the beginning of the loop was also the entry block
of the function, branches were inserted to the entry block
which isn't allowed. If this occurs, create a new dummy
function entry block that branches to the start of the loop.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195493 91177308-0d34-0410-b5e6-96231b3b80d8
Instead of permanently outputting "MVLL" as the file checksum, clang
will create gcno and gcda checksums by hashing the destination block
numbers of every arc. This allows for llvm-cov to check if the two gcov
files are synchronized.
Regenerated the test files so they contain the checksum. Also added
negative test to ensure error when the checksums don't match.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195191 91177308-0d34-0410-b5e6-96231b3b80d8
We are slicing an array of Value pointers and process those slices in a loop.
The problem is that we might invalidate a later slice by vectorizing a former
slice.
Use a WeakVH to track the pointer. If the pointer is deleted or RAUW'ed we can
tell.
The test case will only fail when running with libgmalloc.
radar://15498655
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195162 91177308-0d34-0410-b5e6-96231b3b80d8
order of slices of the alloca which have exactly the same size and other
properties. This was found by a perniciously unstable sort
implementation used to flush out buggy uses of the algorithm.
The fundamental idea is that findCommonType should return the best
common type it can find across all of the slices in the range. There
were two bugs here previously:
1) We would accept an integer type smaller than a byte-width multiple,
and if there were different bit-width integer types, we would accept
the first one. This caused an actual failure in the testcase updated
here when the sort order changed.
2) If we found a bad combination of types or a non-load, non-store use
before an integer typed load or store we would bail, but if we found
the integere typed load or store, we would use it. The correct
behavior is to always use an integer typed operation which covers the
partition if one exists.
While a clever debugging sort algorithm found problem #1 in our existing
test cases, I have no useful test case ideas for #2. I spotted in by
inspection when looking at this code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195118 91177308-0d34-0410-b5e6-96231b3b80d8
(except functions marked always_inline).
Functions with 'optnone' must also have 'noinline' so they don't get
inlined into any other function.
Based on work by Andrea Di Biagio.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195046 91177308-0d34-0410-b5e6-96231b3b80d8
In some case the loop exit count computation can overflow. Extend the type to
prevent most of those cases.
The problem is loops like:
int main ()
{
int a = 1;
char b = 0;
lbl:
a &= 4;
b--;
if (b) goto lbl;
return a;
}
The backedge count is 255. The induction variable type is i8. If we add one to
255 to get the exit count we overflow to zero.
To work around this issue we extend the type of the induction variable to i32 in
the case of i8 and i16.
PR17532
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@195008 91177308-0d34-0410-b5e6-96231b3b80d8
Generally speaking, control flow paths with error reporting calls are cold.
So far, error reporting calls are calls to perror and calls to fprintf,
fwrite, etc. with stderr as the stream. This can be extended in the future.
The primary motivation is to improve block placement (the cold attribute
affects the static branch prediction heuristics).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194943 91177308-0d34-0410-b5e6-96231b3b80d8
This adds a loop rerolling pass: the opposite of (partial) loop unrolling. The
transformation aims to take loops like this:
for (int i = 0; i < 3200; i += 5) {
a[i] += alpha * b[i];
a[i + 1] += alpha * b[i + 1];
a[i + 2] += alpha * b[i + 2];
a[i + 3] += alpha * b[i + 3];
a[i + 4] += alpha * b[i + 4];
}
and turn them into this:
for (int i = 0; i < 3200; ++i) {
a[i] += alpha * b[i];
}
and loops like this:
for (int i = 0; i < 500; ++i) {
x[3*i] = foo(0);
x[3*i+1] = foo(0);
x[3*i+2] = foo(0);
}
and turn them into this:
for (int i = 0; i < 1500; ++i) {
x[i] = foo(0);
}
There are two motivations for this transformation:
1. Code-size reduction (especially relevant, obviously, when compiling for
code size).
2. Providing greater choice to the loop vectorizer (and generic unroller) to
choose the unrolling factor (and a better ability to vectorize). The loop
vectorizer can take vector lengths and register pressure into account when
choosing an unrolling factor, for example, and a pre-unrolled loop limits that
choice. This is especially problematic if the manual unrolling was optimized
for a machine different from the current target.
The current implementation is limited to single basic-block loops only. The
rerolling recognition should work regardless of how the loop iterations are
intermixed within the loop body (subject to dependency and side-effect
constraints), but the significant restriction is that the order of the
instructions in each iteration must be identical. This seems sufficient to
capture all current use cases.
This pass is not currently enabled by default at any optimization level.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194939 91177308-0d34-0410-b5e6-96231b3b80d8
InstCombine, in visitFPTrunc, applies the following optimization to sqrt calls:
(fptrunc (sqrt (fpext x))) -> (sqrtf x)
but does not apply the same optimization to llvm.sqrt. This is a problem
because, to enable vectorization, Clang generates llvm.sqrt instead of sqrt in
fast-math mode, and because this optimization is being applied to sqrt and not
applied to llvm.sqrt, sometimes the fast-math code is slower.
This change makes InstCombine apply this optimization to llvm.sqrt as well.
This fixes the specific problem in PR17758, although the same underlying issue
(optimizations applied to libcalls are not applied to intrinsics) exists for
other optimizations in SimplifyLibCalls.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194935 91177308-0d34-0410-b5e6-96231b3b80d8
When we vectorize a scalar access with no alignment specified, we have to set
the target's abi alignment of the scalar access on the vectorized access.
Using the same alignment of zero would be wrong because most targets will have a
bigger abi alignment for vector types.
This probably fixes PR17878.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194876 91177308-0d34-0410-b5e6-96231b3b80d8
We used to use std::map<IndicesVector, LoadInst*> for OriginalLoads, and when we
try to promote two arguments, they will both write to OriginalLoads causing
created loads for the two arguments to have the same original load. And the same
tbaa tag and alignment will be put to the created loads for the two arguments.
The fix is to use std::map<std::pair<Argument*, IndicesVector>, LoadInst*>
for OriginalLoads, so each Argument will write to different parts of the map.
PR17906
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194846 91177308-0d34-0410-b5e6-96231b3b80d8
with and without -g.
Adding a test case to make sure that the threshold used in the memory
dependence analysis is respected. The test case also checks that debug
intrinsics are not counted towards this threshold.
Differential Revision: http://llvm-reviews.chandlerc.com/D2141
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194646 91177308-0d34-0410-b5e6-96231b3b80d8
This adds a new scalar pass that reads a file with samples generated
by 'perf' during runtime. The samples read from the profile are
incorporated and emmited as IR metadata reflecting that profile.
The profile file is assumed to have been generated by an external
profile source. The profile information is converted into IR metadata,
which is later used by the analysis routines to estimate block
frequencies, edge weights and other related data.
External profile information files have no fixed format, each profiler
is free to define its own. This includes both the on-disk representation
of the profile and the kind of profile information stored in the file.
A common kind of profile is based on sampling (e.g., perf), which
essentially counts how many times each line of the program has been
executed during the run.
The SampleProfileLoader pass is organized as a scalar transformation.
On startup, it reads the file given in -sample-profile-file to
determine what kind of profile it contains. This file is assumed to
contain profile information for the whole application. The profile
data in the file is read and incorporated into the internal state of
the corresponding profiler.
To facilitate testing, I've organized the profilers to support two file
formats: text and native. The native format is whatever on-disk
representation the profiler wants to support, I think this will mostly
be bitcode files, but it could be anything the profiler wants to
support. To do this, every profiler must implement the
SampleProfile::loadNative() function.
The text format is mostly meant for debugging. Records are separated by
newlines, but each profiler is free to interpret records as it sees fit.
Profilers must implement the SampleProfile::loadText() function.
Finally, the pass will call SampleProfile::emitAnnotations() for each
function in the current translation unit. This function needs to
translate the loaded profile into IR metadata, which the analyzer will
later be able to use.
This patch implements the first steps towards the above design. I've
implemented a sample-based flat profiler. The format of the profile is
fairly simplistic. Each sampled function contains a list of relative
line locations (from the start of the function) together with a count
representing how many samples were collected at that line during
execution. I generate this profile using perf and a separate converter
tool.
Currently, I have only implemented a text format for these profiles. I
am interested in initial feedback to the whole approach before I send
the other parts of the implementation for review.
This patch implements:
- The SampleProfileLoader pass.
- The base ExternalProfile class with the core interface.
- A SampleProfile sub-class using the above interface. The profiler
generates branch weight metadata on every branch instructions that
matches the profiles.
- A text loader class to assist the implementation of
SampleProfile::loadText().
- Basic unit tests for the pass.
Additionally, the patch uses profile information to compute branch
weights based on instruction samples.
This patch converts instruction samples into branch weights. It
does a fairly simplistic conversion:
Given a multi-way branch instruction, it calculates the weight of
each branch based on the maximum sample count gathered from each
target basic block.
Note that this assignment of branch weights is somewhat lossy and can be
misleading. If a basic block has more than one incoming branch, all the
incoming branches will get the same weight. In reality, it may be that
only one of them is the most heavily taken branch.
I will adjust this assignment in subsequent patches.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194566 91177308-0d34-0410-b5e6-96231b3b80d8
Constant merge can merge a constant with implicit alignment with one that has
explicit alignment. Before this change it was assuming that the explicit
alignment was higher than the implicit one, causing the result to be under
aligned in some cases.
Fixes pr17815.
Patch by Chris Smowton!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194506 91177308-0d34-0410-b5e6-96231b3b80d8
The symptom is that an assertion is triggered. The assertion was added by
me to detect the situation when value is propagated from dead blocks.
(We can certainly get rid of assertion; it is safe to do so, because propagating
value from dead block to alive join node is certainly ok.)
The root cause of this bug is : edge-splitting is conducted on the fly,
the edge being split could be a dead edge, therefore the block that
split the critial edge needs to be flagged "dead" as well.
There are 3 ways to fix this bug:
1) Get rid of the assertion as I mentioned eariler
2) When an dead edge is split, flag the inserted block "dead".
3) proactively split the critical edges connecting dead and live blocks when
new dead blocks are revealed.
This fix go for 3) with additional 2 LOC.
Testing case was added by Rafael the other day.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194424 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Consider a GEP of:
i8* getelementptr ({ [2 x i8], i32, i8, [3 x i8] }* @main.c, i32 0, i32 0, i64 0)
If we proceeded to GEP the aforementioned object by 8, would form a GEP of:
i8* getelementptr ({ [2 x i8], i32, i8, [3 x i8] }* @main.c, i32 0, i32 0, i64 8)
Note that we would go through the first array member, causing an
out-of-bounds accesses. This is problematic because we might get fooled
if we are trying to evaluate loads using this GEP, for example, based
off of an object with a constant initializer where the array is zero.
This fixes PR17732.
Reviewers: nicholas, chandlerc, void
Reviewed By: void
CC: llvm-commits, echristo, void, aemerson
Differential Revision: http://llvm-reviews.chandlerc.com/D2093
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194220 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Michele Scandale!
Rewrite of the functions used to compute the backedge taken count of a
loop on LT and GT comparisons.
I decided to split the handling of LT and GT cases becasue the trick
"a > b == -a < -b" in some cases prevents the trip count computation
due to the multiplication by -1 on the two operands of the
comparison. This issue comes from the conservative computation of
value range of SCEVs: taking the negative SCEV of an expression that
have a small positive range (e.g. [0,31]), we would have a SCEV with a
fullset as value range.
Indeed, in the new rewritten function I tried to better handle the
maximum backedge taken count computation when MAX/MIN expression are
used to handle the cases where no entry guard is found.
Some test have been modified in order to check the new value correctly
(I manually check them and reasoning on possible overflow the new
values seem correct).
I finally added a new test case related to the multiplication by -1
issue on GT comparisons.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194116 91177308-0d34-0410-b5e6-96231b3b80d8
Due to the previously added overflow checks, we can have a retain/release
relation that is one directional. This occurs specifically when we run into an
additive overflow causing us to drop state in only one direction. If that
occurs, we should bail and not optimize that retain/release instead of
asserting.
Apologies for the size of the testcase. It is necessary to cause the additive
cfg overflow to trigger.
rdar://15377890
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194083 91177308-0d34-0410-b5e6-96231b3b80d8
When the elements are extracted from a select on vectors
or a vector select, do the select on the extracted scalars
from the input if there is only one use.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@194013 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r193356, it caused PR17781.
A reduced test case covering this regression has been added to the test suite.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193955 91177308-0d34-0410-b5e6-96231b3b80d8
This adds an SimplifyLibCalls case which converts the special __sinpi and
__cospi (float & double variants) into a __sincospi_stret where appropriate to
remove duplicated work.
Patch by Tim Northover
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193943 91177308-0d34-0410-b5e6-96231b3b80d8
When the loop vectorizer was part of the SCC inliner pass manager gvn would
run after the loop vectorizer followed by instcombine. This way redundancy
(multiple uses) were removed and instcombine could perform scalarization on the
induction variables. Having moved the loop vectorizer to later we no longer run
any form of redundancy elimination before we perform instcombine. This caused
vectorized induction variables to survive that did not before.
On a recent iMac this helps linpack back from 6000Mflops to 7000Mflops.
This should also help lpbench and paq8p.
I ran a Release (without Asserts) build over the test-suite and did not see any
negative impact on compile time.
radar://15339680
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193891 91177308-0d34-0410-b5e6-96231b3b80d8
If we have a pointer to a single-element struct we can still build wide loads
and stores to it (if there is no padding).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193860 91177308-0d34-0410-b5e6-96231b3b80d8
When a dependence check fails we can still try to vectorize loops with runtime
array bounds checks.
This helps linpack to vectorize a loop in dgefa. And we are back to 2x of the
scalar performance on a corei7-avx.
radar://15339680
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193853 91177308-0d34-0410-b5e6-96231b3b80d8
Given that backend does not handle "invoke asm" correctly ("invoke asm" will be
handled by SelectionDAGBuilder::visitInlineAsm, which does not have the right
setup for LPadToCallSiteMap) and we already made the assumption that inline asm
does not throw in InstCombiner::visitCallSite, we are going to make the same
assumption in Inliner to make sure we don't convert "call asm" to "invoke asm".
If it becomes necessary to add support for "invoke asm" later on, we will need
to modify the backend as well as remove the assumptions that inline asm does
not throw.
Fix rdar://15317907
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193808 91177308-0d34-0410-b5e6-96231b3b80d8
There are two ways one could implement hiding of linkonce_odr symbols in LTO:
* LLVM tells the linker which symbols can be hidden if not used from native
files.
* The linker tells LLVM which symbols are not used from other object files,
but will be put in the dso symbol table if present.
GOLD's API is the second option. It was implemented almost 1:1 in llvm by
passing the list down to internalize.
LLVM already had partial support for the first option. It is also very similar
to how ld64 handles hiding these symbols when *not* doing LTO.
This patch then
* removes the APIs for the DSO list.
* marks LTO_SYMBOL_SCOPE_DEFAULT_CAN_BE_HIDDEN all linkonce_odr unnamed_addr
global values and other linkonce_odr whose address is not used.
* makes the gold plugin responsible for handling the API mismatch.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193800 91177308-0d34-0410-b5e6-96231b3b80d8
Updated a test case that assumed that <2 x double> would vectorize to use
<4 x float>.
radar://15338229
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193574 91177308-0d34-0410-b5e6-96231b3b80d8
By vectorizing a series of srl, or, ... instructions we have obfuscated the
intention so much that the backend does not know how to fold this code away.
radar://15336950
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193573 91177308-0d34-0410-b5e6-96231b3b80d8
Partial fix for PR17459: wrong code at -O3 on x86_64-linux-gnu
(affecting trunk and 3.3)
When SCEV expands a recurrence outside of a loop it attempts to scale
by the stride of the recurrence. Chained recurrences don't work that
way. We could compute binomial coefficients, but would hve to
guarantee that the chained AddRec's are in a perfectly reduced form.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193438 91177308-0d34-0410-b5e6-96231b3b80d8
Partial fix for PR17459: wrong code at -O3 on x86_64-linux-gnu
(affecting trunk and 3.3)
ScalarEvolutionNormalization was attempting to normalize by adding and
subtracting strides. Chained recurrences don't work that way.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193437 91177308-0d34-0410-b5e6-96231b3b80d8
This patch teaches GlobalStatus to analyze a call that uses the global value as
a callee, not as an argument.
With this change internalize call handle the common use of linkonce_odr
functions. This reduces the number of linkonce_odr functions in a LTO build of
clang (checked with the emit-llvm gold plugin option) from 1730 to 60.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193436 91177308-0d34-0410-b5e6-96231b3b80d8
The loop vectorizer does not currently understand how to vectorize
extractelement instructions. The existing check, which excluded all
vector-valued instructions, did not catch extractelement instructions because
it checked only the return value. As a result, vectorization would proceed,
producing illegal instructions like this:
%58 = extractelement <2 x i32> %15, i32 0
%59 = extractelement i32 %58, i32 0
where the second extractelement is illegal because its first operand is not a vector.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193434 91177308-0d34-0410-b5e6-96231b3b80d8
Make sure we mark all loops (scalar and vector) when vectorizing,
so that we don't try to vectorize them anymore. Also, set unroll
to 1, since this is what we check for on early exit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193349 91177308-0d34-0410-b5e6-96231b3b80d8
Major steps include:
1). introduces a not-addr-taken bit-field in GlobalVariable
2). GlobalOpt pass sets "not-address-taken" if it proves a global varirable
dosen't have its address taken.
3). AA use this info for disambiguation.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193251 91177308-0d34-0410-b5e6-96231b3b80d8
When a linkonce_odr value that is on the dso list is not unnamed_addr
we can still look to see if anything is actually using its address. If
not, it is safe to hide it.
This patch implements that by moving GlobalStatus to Transforms/Utils
and using it in Internalize.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193090 91177308-0d34-0410-b5e6-96231b3b80d8
A landing pad can be jumped to only by the unwind edge of an invoke
instruction. If we eliminate a partially redundant load in a landing pad, it
will create a basic block that violates this constraint. It then leads to other
problems down the line if it tries to merge that basic block with the landing
pad. Avoid this by not eliminating the load in a landing pad.
PR17621
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193064 91177308-0d34-0410-b5e6-96231b3b80d8
One optimization simplify-cfg performs is the converting of switches to
lookup tables if the switch has > 4 cases. This is done by:
1. Finding the max/min case value and calculating the switch case range.
2. Create a lookup table basic block.
3. Perform a check in the switch's BB to see if the input value is in
the switch's case range. If the input value satisfies said predicate
branch to the lookup table BB, otherwise branch to the switch's default
destination BB using the default value as the result.
The conditional check consists of subtracting the min case value of the
table from any input iN value and then ensuring that said value is
unsigned less than the size of the lookup table represented as an iN
value.
If the lookup table is a covered lookup table, the size of the table will be N
which is 0 as an iN value. Thus the comparison will be an `icmp ult` of an iN
value against 0 which is always false yielding the incorrect result.
This patch fixes this problem by recognizing if we have a covered lookup table
and if we do, unconditionally jumps to the lookup table BB since the covering
property of the lookup table implies no input values could not be handled by
said BB.
rdar://15268442
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193045 91177308-0d34-0410-b5e6-96231b3b80d8
If the predecessor's being spliced into a landing pad, then we need the PHIs to
come first and the rest of the predecessor's code to come *after* the landing
pad instruction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@193035 91177308-0d34-0410-b5e6-96231b3b80d8
Before this patch we relied on the order of phi nodes when we looked for phi
nodes of the same type. This could prevent vectorization of cases where there
was a phi node of a second type in between phi nodes of some type.
This is important for vectorization of an internal graphics kernel. On the test
suite + external on x86_64 (and on a run on armv7s) it showed no impact on
either performance or compile time.
radar://15024459
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192537 91177308-0d34-0410-b5e6-96231b3b80d8
If a function seen at compile time is not necessarily the one linked to
the binary being built, it is illegal to change the actual arguments
passing to it.
e.g.
--------------------------
void foo(int lol) {
// foo() has linkage satisifying isWeakForLinker()
// "lol" is not used at all.
}
void bar(int lo2) {
// xform to foo(undef) is illegal, as compiler dose not know which
// instance of foo() will be linked to the the binary being built.
foo(lol2);
}
-----------------------------
Such functions can be captured by isWeakForLinker(). NOTE that
mayBeOverridden() is insufficient for this purpose as it dosen't include
linkage types like AvailableExternallyLinkage and LinkOnceODRLinkage.
Take link_odr* as an example, it indicates a set of *EQUIVALENT* globals
that can be merged at link-time. However, the semantic of
*EQUIVALENT*-functions includes parameters. Changing parameters breaks
the assumption.
Thank John McCall for help, especially for the explanation of subtle
difference between linkage types.
rdar://11546243
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192302 91177308-0d34-0410-b5e6-96231b3b80d8
Otherwise, we don't perform operations that would have been performed on
the scalar version.
Fixes PR17498.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192133 91177308-0d34-0410-b5e6-96231b3b80d8
Bitcasting everything to i8* won't work. Autoupgrade the old
intrinsic declarations to use the new mangling.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192117 91177308-0d34-0410-b5e6-96231b3b80d8
is updated to use DITypeRef.
Move isUnsignedDIType and getOriginalTypeSize from DebugInfo.h to be static
helper functions in DwarfCompileUnit. We already have a static helper function
"isTypeSigned" in DwarfCompileUnit, and a pointer to DwarfDebug is added to
resolve the derived-from field. All three functions need to go across link
for derived-from fields, so we need to get hold of a type identifier map.
A pointer to DwarfDebug is also added to DbgVariable in order to resolve the
derived-from field.
Debug info verifier is updated to check a derived-from field is a TypeRef.
Verifier will not go across link for derived-from fields, in debug info finder,
we go across the link to add derived-from fields to types.
Function getDICompositeType is only used by dragonegg and since dragonegg does
not generate identifier for types, we use an empty map to resolve the
derived-from field.
When printing a derived-from field, we use DITypeRef::getName to either return
the type identifier or getName of the DIType.
A paired commit at clang is required due to changes to DIBuilder.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192018 91177308-0d34-0410-b5e6-96231b3b80d8
UpdatePHINodes has an optimization to reuse an existing PHI node, where it
first deletes all of its entries and then replaces them. Unfortunately, in the
case where we had duplicate predecessors (which are allowed so long as the
associated PHI entries have the same value), the loop removing the existing PHI
entries from the to-be-reused PHI would assert (if that PHI was not the one
which had the duplicates).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192001 91177308-0d34-0410-b5e6-96231b3b80d8
Sort the operands of the other entries in the current vectorization root
according to the first entry's operands opcodes.
%conv0 = uitofp ...
%load0 = load float ...
= fmul %conv0, %load0
= fmul %load0, %conv1
= fmul %load0, %conv2
Make sure that we recursively vectorize <%conv0, %conv1, %conv2> and <%load0,
%load0, %load0>.
This makes it more likely to obtain vectorizable trees. We have to be careful
when we sort that we don't destroy 'good' existing ordering implied by source
order.
radar://15080067
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@191977 91177308-0d34-0410-b5e6-96231b3b80d8
on platforms with relocations in debug info and also temporarily
revert r191800 due to conflicts with the revert of r191792.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@191967 91177308-0d34-0410-b5e6-96231b3b80d8
Generalize the API so we can distinguish symbols that are needed just for a DSO
symbol table from those that are used from some native .o.
The symbols that are only wanted for the dso symbol table can be dropped if
llvm can prove every other dso has a copy (linkonce_odr) and the address is not
important (unnamed_addr).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@191922 91177308-0d34-0410-b5e6-96231b3b80d8