llvm-6502/test/Transforms
Chandler Carruth 450b39e971 [SROA] Teach SROA how to much more intelligently handle split loads and
stores.

When there are accesses to an entire alloca with an integer
load or store as well as accesses to small pieces of the alloca, SROA
splits up the large integer accesses. In order to do that, it uses bit
math to merge the small accesses into large integers. While this is
effective, it produces insane IR that can cause significant problems in
the rest of the optimizer:

- It can cause load and store mismatches with GVN on the non-alloca side
  where we end up loading an i64 (or some such) rather than loading
  specific elements that are stored.
- We can't always get rid of the integer bit math, which is why we can't
  always fix the loads and stores to work well with GVN.
- This is especially bad when we have operations that mix poorly with
  integer bit math such as floating point operations.
- It will block things like the vectorizer which might be able to handle
  the scalar stores that underly the aggregate.

At the same time, we can't just directly split up these loads and stores
in all cases. If there is actual integer arithmetic involved on the
values, then using integer bit math is actually the perfect lowering
because we can often combine it heavily with the surrounding math.

The solution this patch provides is to find places where SROA is
partitioning aggregates into small elements, and look for splittable
loads and stores that it can split all the way to some other adjacent
load and store. These are uniformly the cases where failing to split the
loads and stores hurts the optimizer that I have seen, and I've looked
extensively at the code produced both from more and less aggressive
approaches to this problem.

However, it is quite tricky to actually do this in SROA. We may have
loads and stores to the same alloca, or other complex patterns that are
hard to handle. This complexity leads to the somewhat subtle algorithm
implemented here. We have to do this entire process as a separate pass
over the partitioning of the alloca, and split up all of the loads prior
to splitting the stores so that we can handle safely the cases of
overlapping, including partially overlapping, loads and stores to the
same alloca. We also have to reconstitute the post-split slice
configuration so we can avoid iterating again over all the alloca uses
(the slow part of SROA). But we also have to ensure that when we split
up loads and stores to *other* allocas, we *do* re-iterate over them in
SROA to adapt to the more refined partitioning now required.

With this, I actually think we can fix a long-standing TODO in SROA
where I avoided splitting as many loads and stores as probably should be
splittable. This limitation historically mitigated the fallout of all
the bad things mentioned above. Now that we have more intelligent
handling, I plan to remove the FIXME and more aggressively mark integer
loads and stores as splittable. I'll do that in a follow-up patch to
help with bisecting any fallout.

The net result of this change should be more fine-grained and accurate
scalars being formed out of aggregates. At the very least, Clang now
generates perfect code for this high-level test case using
std::complex<float>:

  #include <complex>

  void g1(std::complex<float> &x, float a, float b) {
    x += std::complex<float>(a, b);
  }
  void g2(std::complex<float> &x, float a, float b) {
    x -= std::complex<float>(a, b);
  }

  void foo(const std::complex<float> &x, float a, float b,
           std::complex<float> &x1, std::complex<float> &x2) {
    std::complex<float> l1 = x;
    g1(l1, a, b);
    std::complex<float> l2 = x;
    g2(l2, a, b);
    x1 = l1;
    x2 = l2;
  }

This code isn't just hypothetical either. It was reduced out of the hot
inner loops of essentially every part of the Eigen math library when
using std::complex<float>. Those loops would consistently and
pervasively hop between the floating point unit and the integer unit due
to bit math extraction and insertion of floating point values that were
"stored" in a 64-bit integer register around the loop backedge.

So far, this change has passed a bootstrap and I have done some other
testing and so far, no issues. That doesn't mean there won't be though,
so I'll be prepared to help with any fallout. If you performance swings
in particular, please let me know. I'm very curious what all the impact
of this change will be. Stay tuned for the follow-up to also split more
integer loads and stores.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225061 91177308-0d34-0410-b5e6-96231b3b80d8
2015-01-01 11:54:38 +00:00
..
ADCE
AddDiscriminators IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
AlignmentFromAssumptions
ArgumentPromotion IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
AtomicExpand/ARM
BBVectorize IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
BranchFolding
CodeExtractor
CodeGenPrepare
ConstantHoisting
ConstantMerge
ConstProp
CorrelatedValuePropagation LazyValueInfo: Actually re-visit partially solved block-values in solveBlockValue() 2014-11-25 17:23:05 +00:00
DeadArgElim IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
DeadStoreElimination IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
EarlyCSE
FunctionAttrs
GCOVProfiling IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
GlobalDCE
GlobalOpt IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
GVN IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
IndVarSimplify Teach ScalarEvolution to exploit min and max expressions when proving 2014-12-15 22:50:15 +00:00
Inline IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
InstCombine InstCombine: fsub nsz 0, X ==> fsub nsz -0.0, X 2014-12-31 22:14:05 +00:00
InstMerge Added 5 more tests related to sink store revision 224247 2014-12-17 08:12:59 +00:00
InstSimplify InstSimplify: Optimize away pointless comparisons 2014-12-20 03:04:38 +00:00
Internalize
IPConstantProp
JumpThreading IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LCSSA [LCSSA] Handle PHI insertion in disjoint loops 2014-12-22 22:35:46 +00:00
LICM Refine the notion of MayThrow in LICM to include a header specific version 2014-12-29 23:00:57 +00:00
LoadCombine
LoopDeletion
LoopIdiom IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LoopReroll
LoopRotate IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LoopSimplify
LoopStrengthReduce IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LoopUnroll IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LoopUnswitch
LoopVectorize Masked Load and Store Intrinsics in loop vectorizer. 2014-12-16 11:50:42 +00:00
LowerAtomic
LowerExpectIntrinsic IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
LowerInvoke
LowerSwitch
Mem2Reg IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
MemCpyOpt
MergeFunc IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
MetaRenamer
ObjCARC IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
PartiallyInlineLibCalls
PhaseOrdering
PruneEH
Reassociate Revert "[Reassociate] As the expression tree is rewritten make sure the operands are" 2014-11-19 23:21:20 +00:00
Reg2Mem
SampleProfile IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
Scalarizer IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
ScalarRepl IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
SCCP
SeparateConstOffsetFromGEP/NVPTX
SimplifyCFG [SimplifyCFG] Revise common code sinking 2014-12-23 08:26:55 +00:00
Sink
SLPVectorizer Revert 224119 "This patch recognizes (+ (+ v0, v1) (+ v2, v3)), reorders them for bundling into vector of loads, 2014-12-17 10:34:27 +00:00
SROA [SROA] Teach SROA how to much more intelligently handle split loads and 2015-01-01 11:54:38 +00:00
StripSymbols IR: Make metadata typeless in assembly 2014-12-15 19:07:53 +00:00
StructurizeCFG StructurizeCFG: Use LoopInfo analysis for better loop detection 2014-12-03 04:28:32 +00:00
TailCallElim Fix tail recursion elimination 2014-11-19 13:32:51 +00:00
TailDup
Util [SwitchLowering] Handle destinations on multiple phi instructions 2014-12-02 18:31:53 +00:00