registers that arise from argument shuffling with the soft float ABI. These
instructions are particularly slow on Cortex A8. This fixes one half of
<rdar://problem/8674845>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128759 91177308-0d34-0410-b5e6-96231b3b80d8
The rematerialized instruction may require a more constrained register class
than the register being spilled. In the test case, the spilled register has been
inflated to the DPR register class, but we are rematerializing a load of the
ssub_0 sub-register which only exists for DPR_VFP2 registers.
The register class is reinflated after spilling, so the conservative choice is
only temporary.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128610 91177308-0d34-0410-b5e6-96231b3b80d8
was lowering them to sext / uxt + mul instructions. Unfortunately the
optimization passes may hoist the extensions out of the loop and separate them.
When that happens, the long multiplication instructions can be broken into
several scalar instructions, causing significant performance issue.
Note the vmla and vmls intrinsics are not added back. Frontend will codegen them
as intrinsics vmull* + add / sub. Also note the isel optimizations for catching
mul + sext / zext are not changed either.
First part of rdar://8832507, rdar://9203134
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128502 91177308-0d34-0410-b5e6-96231b3b80d8
isel lowering to fold the zero-extend's and take advantage of no-stall
back to back vmul + vmla:
vmull q0, d4, d6
vmlal q0, d5, d6
is faster than
vaddl q0, d4, d5
vmovl q1, d6
vmul q0, q0, q1
This allows us to vmull + vmlal for:
f = vmull_u8( vget_high_u8(s), c);
f = vmlal_u8(f, vget_low_u8(s), c);
rdar://9197392
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128444 91177308-0d34-0410-b5e6-96231b3b80d8
int tries = INT_MAX;
while (tries > 0) {
tries--;
}
The check should be:
subs r4, #1
cmp r4, #0
bgt LBB0_1
The subs can set the overflow V bit when r4 is INT_MAX+1 (which loop
canonicalization apparently does in this case). cmp #0 would have cleared
it while not changing the N and Z bits. Since BGT is dependent on the V
bit, i.e. (N == V) && !Z, it is not safe to eliminate the cmp #0.
rdar://9172742
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@128179 91177308-0d34-0410-b5e6-96231b3b80d8
gun as does. This makes it a lot easier to compare the output of both
as the addresses are now a lot closer.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127972 91177308-0d34-0410-b5e6-96231b3b80d8
plus the test where it used to break.", which broke Clang self-host of a
Debug+Asserts compiler, on OS X.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127763 91177308-0d34-0410-b5e6-96231b3b80d8
v2 = bitcast v1
...
v3 = bitcast v2
...
= v3
=>
v2 = bitcast v1
...
= v1
if v1 and v3 are of in the same register class.
bitcast between i32 and fp (and others) are often not nops since they
are in different register classes. These bitcast instructions are often
left because they are in different basic blocks and cannot be
eliminated by dag combine.
rdar://9104514
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127668 91177308-0d34-0410-b5e6-96231b3b80d8
Also more cleanly separate the ARM vs. Thumb functionality. Previously, the
encoding would be incorrect for some Thumb instructions (the indirect calls).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127637 91177308-0d34-0410-b5e6-96231b3b80d8
Go ahead and add them on when we might want to use them and let
later passes remove them.
Fixes rdar://9118569
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127518 91177308-0d34-0410-b5e6-96231b3b80d8
Optimize trivial branches in CodeGenPrepare, which often get created from the
lowering of objectsize intrinsics. Unfortunately, a number of tests were relying
on llc not optimizing trivial branches, so I had to add an option to allow them
to continue to test what they originally tested.
This fixes <rdar://problem/8785296> and <rdar://problem/9112893>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127498 91177308-0d34-0410-b5e6-96231b3b80d8
lowering of objectsize intrinsics. Unfortunately, a number of tests were relying
on llc not optimizing trivial branches, so I had to add an option to allow them
to continue to test what they originally tested.
This fixes <rdar://problem/8785296> and <rdar://problem/9112893>.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@127459 91177308-0d34-0410-b5e6-96231b3b80d8
The previous codegen for the slow path (when values are in VFP / NEON
registers) was incorrect if the source is NaN.
The new codegen uses NEON vbsl instruction to copy the sign bit. e.g.
vmov.i32 d1, #0x80000000
vbsl d1, d2, d0
If NEON is not available, it uses integer instructions to copy the sign bit.
rdar://9034702
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@126295 91177308-0d34-0410-b5e6-96231b3b80d8
The i64_buildvector test in this file relies on the alignment of i64 and
f64 types being the same, which is true for Darwin but not AAPCS.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@125525 91177308-0d34-0410-b5e6-96231b3b80d8
This
define float @foo(float %x, float %y) nounwind readnone {
entry:
%0 = tail call float @copysignf(float %x, float %y) nounwind readnone
ret float %0
}
Was compiled to:
vmov s0, r1
bic r0, r0, #-2147483648
vmov s1, r0
vcmpe.f32 s0, #0
vmrs apsr_nzcv, fpscr
it lt
vneglt.f32 s1, s1
vmov r0, s1
bx lr
This fails to copy the sign of -0.0f because it's lost during the float to int
conversion. Also, it's sub-optimal when the inputs are in GPR registers.
Now it uses integer and + or operations when it's profitable. And it's correct!
lsrs r1, r1, #31
bfi r0, r1, #31, #1
bx lr
rdar://8984306
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@125357 91177308-0d34-0410-b5e6-96231b3b80d8
The vld1-lane, vld1-dup and vst1-lane instructions do not yet support using
post-increment versions, but all the rest of the NEON load/store instructions
should be handled now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@125014 91177308-0d34-0410-b5e6-96231b3b80d8
Unified EmitTextAttribute for both Asm and Obj emission (.cpu only)
Added necessary cortex-A8 related attrs for codegen compat tests.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@124995 91177308-0d34-0410-b5e6-96231b3b80d8
the load, then it may be legal to transform the load and store to integer
load and store of the same width.
This is done if the target specified the transformation as profitable. e.g.
On arm, this can transform:
vldr.32 s0, []
vstr.32 s0, []
to
ldr r12, []
str r12, []
rdar://8944252
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@124708 91177308-0d34-0410-b5e6-96231b3b80d8
1. Fixed ARM pc adjustment.
2. Fixed dynamic-no-pic codegen
3. CSE of pc-relative load of global addresses.
It's now enabled by default for Darwin.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123991 91177308-0d34-0410-b5e6-96231b3b80d8
DAG. Disable using "-disable-sched-cycles".
For ARM, this enables a framework for modeling the cpu pipeline and
counting stalls. It also activates several heuristics to drive
scheduling based on the model. Scheduling is inherently imprecise at
this stage, and until spilling is improved it may defeat attempts to
schedule. However, this framework provides greater control over
tuning codegen.
Although the flag is not target-specific, it should have very little
affect on the default scheduler used by x86. The only two changes that
affect x86 are:
- scheduling a high-latency operation bumps the current cycle so independent
operations can have their latency covered. i.e. two independent 4
cycle operations can produce results in 4 cycles, not 8 cycles.
- Two operations with equal register pressure impact and no
latency-based stalls on their uses will be prioritized by depth before height
(height is irrelevant if no stalls occur in the schedule below this point).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123971 91177308-0d34-0410-b5e6-96231b3b80d8
flags. They are still not enable in this revision.
Added TargetInstrInfo::isZeroCost() to fix a fundamental problem with
the scheduler's model of operand latency in the selection DAG.
Generalized unit tests to work with sched-cycles.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123969 91177308-0d34-0410-b5e6-96231b3b80d8
value, the "add pc" must be CSE'ed at the same time. We could follow the same
approach as T2 by adding pseudo instructions that combine the ldr + "add pc".
But the better approach is to use movw + movt (which I will enable soon), so
I'll leave this as a TODO.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123949 91177308-0d34-0410-b5e6-96231b3b80d8
TargetInstrInfo:
Change produceSameValue() to take MachineRegisterInfo as an optional argument.
When in SSA form, targets can use it to make more aggressive equality analysis.
Machine LICM:
1. Eliminate isLoadFromConstantMemory, use MI.isInvariantLoad instead.
2. Fix a bug which prevent CSE of instructions which are not re-materializable.
3. Use improved form of produceSameValue.
ARM:
1. Teach ARM produceSameValue to look pass some PIC labels.
2. Look for operands from different loads of different constant pool entries
which have same values.
3. Re-implement PIC GA materialization using movw + movt. Combine the pair with
a "add pc" or "ldr [pc]" to form pseudo instructions. This makes it possible
to re-materialize the instruction, allow machine LICM to hoist the set of
instructions out of the loop and make it possible to CSE them. It's a bit
hacky, but it significantly improve code quality.
4. Some minor bug fixes as well.
With the fixes, using movw + movt to materialize GAs significantly outperform the
load from constantpool method. 186.crafty and 255.vortex improved > 20%, 254.gap
and 176.gcc ~10%.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123905 91177308-0d34-0410-b5e6-96231b3b80d8
with an invalid type then split the result and perform the overflow check
normally.
Fixes the 32-bit parts of rdar://8622122 and rdar://8774702.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123864 91177308-0d34-0410-b5e6-96231b3b80d8
movw r0, :lower16:(L_foo$non_lazy_ptr-(LPC0_0+4))
movt r0, :upper16:(L_foo$non_lazy_ptr-(LPC0_0+4))
LPC0_0:
add r0, pc, r0
It's not yet enabled by default as some tests are failing. I suspect bugs in
down stream tools.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@123619 91177308-0d34-0410-b5e6-96231b3b80d8
Also fix an off-by-one in SelectionDAGBuilder that was preventing shuffle
vectors from being translated to EXTRACT_SUBVECTOR.
Patch by Tim Northover.
The test changes are needed to keep those spill-q tests from testing aligned
spills and restores. If the only aligned stack objects are spill slots, we
no longer realign the stack frame. Prior to this patch, an EXTRACT_SUBVECTOR
was legalized by loading from the stack, which created an aligned frame index.
Now, however, there is nothing except the spill slot in the stack frame, so
I added an aligned alloca.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@122995 91177308-0d34-0410-b5e6-96231b3b80d8
If the basic block containing the BCCi64 (or BCCZi64) instruction ends with
an unconditional branch, that branch needs to be deleted before appending
the expansion of the BCCi64 to the end of the block.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@122521 91177308-0d34-0410-b5e6-96231b3b80d8
Type legalization splits up i64 values into pairs of i32 values, which leads
to poor quality code when inserting or extracting i64 vector elements.
If the vector element is loaded or stored, it can be treated as an f64 value
and loaded or stored directly from a VPR register. Use the pre-legalization
DAG combiner to cast those vector elements to f64 types so that the type
legalizer won't mess them up. Radar 8755338.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@122319 91177308-0d34-0410-b5e6-96231b3b80d8
may be called. If the entry block is empty, the insertion point iterator will be
the "end()" value. Calling ->getParent() on it (among others) causes problems.
Modify materializeFrameBaseRegister to take the machine basic block and insert
the frame base register at the beginning of that block. (It's very similar to
what the code does all ready. The only difference is that it will always insert
at the beginning of the entry block instead of after a previous materialization
of the frame base register. I doubt that that matters here.)
<rdar://problem/8782198>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@122104 91177308-0d34-0410-b5e6-96231b3b80d8
BUILD_VECTOR operands where the element type is not legal. I had previously
changed this code to insert TRUNCATE operations, but that was just wrong.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@122102 91177308-0d34-0410-b5e6-96231b3b80d8
Clang is now providing intrinsics for these and so we need to support them
in the backend. Radar 8068427.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121902 91177308-0d34-0410-b5e6-96231b3b80d8
Alignments smaller than the total size of the memory being loaded or stored,
unless the alignment is 8 bytes, are not allowed. Add tests for this, too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121506 91177308-0d34-0410-b5e6-96231b3b80d8
Otherwise, a plain str/ldr should be used instead. Make sure we account for
that in prologue/epilogue code generation.
rdar://8745460
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121391 91177308-0d34-0410-b5e6-96231b3b80d8
Added test to check bl __aeabi_read_tp gets emitted properly for ELF/ASM
as well as ELF/OBJ (including fixup)
Also added support for ELF::R_ARM_TLS_IE32
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121312 91177308-0d34-0410-b5e6-96231b3b80d8
vpush instructions to save / restore VFP / NEON registers like this:
vpush {d8,d10,d11}
vpop {d8,d10,d11}
vpush and vpop do not allow gaps in the register list.
rdar://8728956
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121197 91177308-0d34-0410-b5e6-96231b3b80d8
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Work in progress, only A+B are enabled.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120960 91177308-0d34-0410-b5e6-96231b3b80d8
Lifted adjustFixupValue() from Darwin for sharing w ELF.
Test added
TODO:
refactor ELFObjectWriter::RecordRelocation more.
Possibly share more code with Darwin?
Lots more relocations...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120534 91177308-0d34-0410-b5e6-96231b3b80d8
legalization time. Since at legalization time there is no mapping from
SDNode back to the corresponding LLVM instruction and the return
SDNode is target specific, this requires a target hook to check for
eligibility. Only x86 and ARM support this form of sibcall optimization
right now.
rdar://8707777
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120501 91177308-0d34-0410-b5e6-96231b3b80d8
We need to check if the individual vector elements are sign/zero-extended
values. For now this only handles constants values. Radar 8687140.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120034 91177308-0d34-0410-b5e6-96231b3b80d8
state. Previously Thumb2 would restore sp from fp like this:
mov sp, r7
sub, sp, #4
If an interrupt is taken after the 'mov' but before the 'sub', callee-saved
registers might be clobbered by the interrupt handler. Instead, try
restoring directly from sp:
add sp, #4
Or, if necessary (with VLA, etc.) use a scratch register to compute sp and
then restore it:
sub.w r4, r7, #8
mov sp, r7
rdar://8465407
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119977 91177308-0d34-0410-b5e6-96231b3b80d8
Remove movePastCSLoadStoreOps and associated code for simple pointer
increments. Update routines that depended upon other opcodes for save/restore.
Adjust all testcases accordingly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119725 91177308-0d34-0410-b5e6-96231b3b80d8
appear to differ on Linux. Try to make them pass on Linux.
Would be good for a Linux person to review this.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119572 91177308-0d34-0410-b5e6-96231b3b80d8
It is generally not sufficient to check if the starting offset is in range
of the maximum offset that can be efficiently used for the target.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119565 91177308-0d34-0410-b5e6-96231b3b80d8
This makes it more clear that the symbol is an internal, compiler-generated
name and gives a little more description about its contents.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119564 91177308-0d34-0410-b5e6-96231b3b80d8
It was mistakenly looking at the pointer type when checking for the size of
global variables. This is a partial fix for Radar 8673120.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119563 91177308-0d34-0410-b5e6-96231b3b80d8
and xor. The 32-bit move immediates can be hoisted out of loops by machine
LICM but the isel hacks were preventing them.
Instead, let peephole optimization pass recognize registers that are defined by
immediates and the ARM target hook will fold the immediates in.
Other changes include 1) do not fold and / xor into cmp to isel TST / TEQ
instructions if there are multiple uses. This happens when the 'and' is live
out, machine sink would have sinked the computation and that ends up pessimizing
code. The peephole pass would recognize situations where the 'and' can be
toggled to define CPSR and eliminate the comparison anyway.
2) Move peephole pass to after machine LICM, sink, and CSE to avoid blocking
important optimizations.
rdar://8663787, rdar://8241368
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119548 91177308-0d34-0410-b5e6-96231b3b80d8
The live range of a register defined by an early clobber starts at the use slot,
not the def slot.
Except when it is an early clobber tied to a use operand. Then it starts at the
def slot like a standard def.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119305 91177308-0d34-0410-b5e6-96231b3b80d8
live ranges for the spill register are also defined at the use slot instead of
the normal def slot.
This fixes PR8612 for the inline spiller. A use was being allocated to the same
register as a spilled early clobber def.
This problem exists in all the spillers. A fix for the standard spiller is
forthcoming.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@119182 91177308-0d34-0410-b5e6-96231b3b80d8
We could be more aggressive about making this work for a larger range of constants,
but this seems like a good start.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@118201 91177308-0d34-0410-b5e6-96231b3b80d8
1. Fix pre-ra scheduler so it doesn't try to push instructions above calls to
"optimize for latency". Call instructions don't have the right latency and
this is more likely to use introduce spills.
2. Fix if-converter cost function. For ARM, it should use instruction latencies,
not # of micro-ops since multi-latency instructions is completely executed
even when the predicate is false. Also, some instruction will be "slower"
when they are predicated due to the register def becoming implicit input.
rdar://8598427
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@118135 91177308-0d34-0410-b5e6-96231b3b80d8
at more than those which define CPSR. You can have this situation:
(1) subs ...
(2) sub r6, r5, r4
(3) movge ...
(4) cmp r6, 0
(5) movge ...
We cannot convert (2) to "subs" because (3) is using the CPSR set by
(1). There's an analogous situation here:
(1) sub r1, r2, r3
(2) sub r4, r5, r6
(3) cmp r4, ...
(5) movge ...
(6) cmp r1, ...
(7) movge ...
We cannot convert (1) to "subs" because of the intervening use of CPSR.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@117950 91177308-0d34-0410-b5e6-96231b3b80d8