manually and pass all (now) 4 arguments to the mul libcall. Add a new
ExpandLibCall for just this (copied gratuitously from type legalization).
Fixes rdar://9292577
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129842 91177308-0d34-0410-b5e6-96231b3b80d8
- There is a minor semantic change here (evidenced by the test change) for
Darwin triples that have no version component. I debated changing the default
behavior of isOSVersionLT, but decided it made more sense for triples to be
explicit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129802 91177308-0d34-0410-b5e6-96231b3b80d8
Making use of VFP / NEON floating point multiply-accumulate / subtraction is
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Enable these fp vmlx codegen changes for Cortex-A9.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129775 91177308-0d34-0410-b5e6-96231b3b80d8
Add a avoidWriteAfterWrite() target hook to identify register classes that
suffer from write-after-write hazards. For those register classes, try to avoid
writing the same register in two consecutive instructions.
This is currently disabled by default. We should not spill to avoid hazards!
The command line flag -avoid-waw-hazard can be used to enable waw avoidance.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129772 91177308-0d34-0410-b5e6-96231b3b80d8
Ideally, we would match an S-register to its containing D-register, but that
requires arithmetic (divide by 2).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129756 91177308-0d34-0410-b5e6-96231b3b80d8
registers for fast allocation a different way. This has us updating
used registers only when we're using that exact register.
Fixes rdar://9207598
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129711 91177308-0d34-0410-b5e6-96231b3b80d8
the max itself, so it is not easy to write a test case for this, but I added a
test case that would fail if the code in AsmPrinter were removed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129432 91177308-0d34-0410-b5e6-96231b3b80d8
alignment for its type, use the minimum of the specified alignment and the ABI
alignment. This fixes <rdar://problem/9275290>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129428 91177308-0d34-0410-b5e6-96231b3b80d8
Additional fixes:
Do something reasonable for subtargets with generic
itineraries by handle node latency the same as for an empty
itinerary. Now nodes default to unit latency unless an itinerary
explicitly specifies a zero cycle stage or it is a TokenFactor chain.
Original fixes:
UnitsSharePred was a source of randomness in the scheduler: node
priority depended on the queue data structure. I rewrote the recent
VRegCycle heuristics to completely replace the old heuristic without
any randomness. To make the ndoe latency adjustments work, I also
needed to do something a little more reasonable with TokenFactor. I
gave it zero latency to its consumers and always schedule it as low as
possible.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129421 91177308-0d34-0410-b5e6-96231b3b80d8
UnitsSharePred was a source of randomness in the scheduler: node
priority depended on the queue data structure. I rewrote the recent
VRegCycle heuristics to completely replace the old heuristic without
any randomness. To make these heuristic adjustments to node latency work,
I also needed to do something a little more reasonable with TokenFactor. I
gave it zero latency to its consumers and always schedule it as low as
possible.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129383 91177308-0d34-0410-b5e6-96231b3b80d8
induction variable. The preRA scheduler is unaware of induction vars,
so we look for potential "virtual register cycles" instead.
Fixes <rdar://problem/8946719> Bad scheduling prevents coalescing
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@129100 91177308-0d34-0410-b5e6-96231b3b80d8
registers that arise from argument shuffling with the soft float ABI. These
instructions are particularly slow on Cortex A8. This fixes one half of
<rdar://problem/8674845>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128759 91177308-0d34-0410-b5e6-96231b3b80d8
The rematerialized instruction may require a more constrained register class
than the register being spilled. In the test case, the spilled register has been
inflated to the DPR register class, but we are rematerializing a load of the
ssub_0 sub-register which only exists for DPR_VFP2 registers.
The register class is reinflated after spilling, so the conservative choice is
only temporary.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128610 91177308-0d34-0410-b5e6-96231b3b80d8
was lowering them to sext / uxt + mul instructions. Unfortunately the
optimization passes may hoist the extensions out of the loop and separate them.
When that happens, the long multiplication instructions can be broken into
several scalar instructions, causing significant performance issue.
Note the vmla and vmls intrinsics are not added back. Frontend will codegen them
as intrinsics vmull* + add / sub. Also note the isel optimizations for catching
mul + sext / zext are not changed either.
First part of rdar://8832507, rdar://9203134
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128502 91177308-0d34-0410-b5e6-96231b3b80d8
isel lowering to fold the zero-extend's and take advantage of no-stall
back to back vmul + vmla:
vmull q0, d4, d6
vmlal q0, d5, d6
is faster than
vaddl q0, d4, d5
vmovl q1, d6
vmul q0, q0, q1
This allows us to vmull + vmlal for:
f = vmull_u8( vget_high_u8(s), c);
f = vmlal_u8(f, vget_low_u8(s), c);
rdar://9197392
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128444 91177308-0d34-0410-b5e6-96231b3b80d8