regB = move RCX
regA = op regB, regC
RAX = move regA
where both regB and regC are killed. If regB is constrainted to non-compatible
physical registers but regC is not constrainted at all, then it's better to
commute the instruction.
movl %edi, %eax
shlq $32, %rcx
leaq (%rcx,%rax), %rax
=>
movl %edi, %eax
shlq $32, %rcx
orq %rcx, %rax
rdar://8762995
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121793 91177308-0d34-0410-b5e6-96231b3b80d8
Use the same COPY_TO_REGCLASS approach as for the 2-register *_sfp instructions.
This change made a big difference in the code generated for the
CodeGen/Thumb2/cross-rc-coalescing-2.ll test: The coalescer is still doing
a fine job, but some instructions that were previously moved outside the loop
are not moved now. It's using fewer VFP registers now, which is generally
a good thing, so I think the estimates for register pressure changed and that
affected the LICM behavior. Since that isn't obviously wrong, I've just
changed the test file. This completes the work for Radar 8711675.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121730 91177308-0d34-0410-b5e6-96231b3b80d8
when the wider type is legal. This allows us to compile:
define zeroext i16 @test1(i16 zeroext %x) nounwind {
entry:
%div = udiv i16 %x, 33
ret i16 %div
}
into:
test1: # @test1
movzwl 4(%esp), %eax
imull $63551, %eax, %eax # imm = 0xF83F
shrl $21, %eax
ret
instead of:
test1: # @test1
movw $-1985, %ax # imm = 0xFFFFFFFFFFFFF83F
mulw 4(%esp)
andl $65504, %edx # imm = 0xFFE0
movl %edx, %eax
shrl $5, %eax
ret
Implementing rdar://8760399 and example #4 from:
http://blog.regehr.org/archives/320
We should implement the same thing for [su]mul_hilo, but I don't
have immediate plans to do this.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121696 91177308-0d34-0410-b5e6-96231b3b80d8
Alignments smaller than the total size of the memory being loaded or stored,
unless the alignment is 8 bytes, are not allowed. Add tests for this, too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121506 91177308-0d34-0410-b5e6-96231b3b80d8
Otherwise, a plain str/ldr should be used instead. Make sure we account for
that in prologue/epilogue code generation.
rdar://8745460
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121391 91177308-0d34-0410-b5e6-96231b3b80d8
the output to the correct register. Fixes a hidden problem uncovered
by the last patch where we'd try to DAG combine our MVT::Other node
oddly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121358 91177308-0d34-0410-b5e6-96231b3b80d8
Added test to check bl __aeabi_read_tp gets emitted properly for ELF/ASM
as well as ELF/OBJ (including fixup)
Also added support for ELF::R_ARM_TLS_IE32
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121312 91177308-0d34-0410-b5e6-96231b3b80d8
vpush instructions to save / restore VFP / NEON registers like this:
vpush {d8,d10,d11}
vpop {d8,d10,d11}
vpush and vpop do not allow gaps in the register list.
rdar://8728956
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@121197 91177308-0d34-0410-b5e6-96231b3b80d8
difficult on current ARM implementations for a few reasons.
1. Even though a single vmla has latency that is one cycle shorter than a pair
of vmul + vadd, a RAW hazard during the first (4? on Cortex-a8) can cause
additional pipeline stall. So it's frequently better to single codegen
vmul + vadd.
2. A vmla folowed by a vmul, vmadd, or vsub causes the second fp instruction to
stall for 4 cycles. We need to schedule them apart.
3. A vmla followed vmla is a special case. Obvious issuing back to back RAW
vmla + vmla is very bad. But this isn't ideal either:
vmul
vadd
vmla
Instead, we want to expand the second vmla:
vmla
vmul
vadd
Even with the 4 cycle vmul stall, the second sequence is still 2 cycles
faster.
Up to now, isel simply avoid codegen'ing fp vmla / vmls. This works well enough
but it isn't the optimial solution. This patch attempts to make it possible to
use vmla / vmls in cases where it is profitable.
A. Add missing isel predicates which cause vmla to be codegen'ed.
B. Make sure the fmul in (fadd (fmul)) has a single use. We don't want to
compute a fmul and a fmla.
C. Add additional isel checks for vmla, avoid cases where vmla is feeding into
fp instructions (except for the #3 exceptional case).
D. Add ARM hazard recognizer to model the vmla / vmls hazards.
E. Add a special pre-regalloc case to expand vmla / vmls when it's likely the
vmla / vmls will trigger one of the special hazards.
Work in progress, only A+B are enabled.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120960 91177308-0d34-0410-b5e6-96231b3b80d8
result. This allows us to compile:
void *test12(long count) {
return new int[count];
}
into:
test12:
movl $4, %ecx
movq %rdi, %rax
mulq %rcx
movq $-1, %rdi
cmovnoq %rax, %rdi
jmp __Znam ## TAILCALL
instead of:
test12:
movl $4, %ecx
movq %rdi, %rax
mulq %rcx
seto %cl
testb %cl, %cl
movq $-1, %rdi
cmoveq %rax, %rdi
jmp __Znam
Of course it would be even better if the regalloc inverted the cmov to 'cmovoq',
which would eliminate the need for the 'movq %rdi, %rax'.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120936 91177308-0d34-0410-b5e6-96231b3b80d8
backend that they were all implemented except umul. This one fell back
to the default implementation that did a hi/lo multiply and compared the
top. Fix this to check the overflow flag that the 'mul' instruction
sets, so we can avoid an explicit test. Now we compile:
void *func(long count) {
return new int[count];
}
into:
__Z4funcl: ## @_Z4funcl
movl $4, %ecx ## encoding: [0xb9,0x04,0x00,0x00,0x00]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
mulq %rcx ## encoding: [0x48,0xf7,0xe1]
seto %cl ## encoding: [0x0f,0x90,0xc1]
testb %cl, %cl ## encoding: [0x84,0xc9]
movq $-1, %rdi ## encoding: [0x48,0xc7,0xc7,0xff,0xff,0xff,0xff]
cmoveq %rax, %rdi ## encoding: [0x48,0x0f,0x44,0xf8]
jmp __Znam ## TAILCALL
instead of:
__Z4funcl: ## @_Z4funcl
movl $4, %ecx ## encoding: [0xb9,0x04,0x00,0x00,0x00]
movq %rdi, %rax ## encoding: [0x48,0x89,0xf8]
mulq %rcx ## encoding: [0x48,0xf7,0xe1]
testq %rdx, %rdx ## encoding: [0x48,0x85,0xd2]
movq $-1, %rdi ## encoding: [0x48,0xc7,0xc7,0xff,0xff,0xff,0xff]
cmoveq %rax, %rdi ## encoding: [0x48,0x0f,0x44,0xf8]
jmp __Znam ## TAILCALL
Other than the silly seto+test, this is using the o bit directly, so it's going in the right
direction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120935 91177308-0d34-0410-b5e6-96231b3b80d8
- Also adds a new POPCNT subtarget feature that is currently enabled if the target
supports SSE4.2 (nehalem) or SSE4A (barcelona).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120917 91177308-0d34-0410-b5e6-96231b3b80d8
Lifted adjustFixupValue() from Darwin for sharing w ELF.
Test added
TODO:
refactor ELFObjectWriter::RecordRelocation more.
Possibly share more code with Darwin?
Lots more relocations...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120534 91177308-0d34-0410-b5e6-96231b3b80d8
legalization time. Since at legalization time there is no mapping from
SDNode back to the corresponding LLVM instruction and the return
SDNode is target specific, this requires a target hook to check for
eligibility. Only x86 and ARM support this form of sibcall optimization
right now.
rdar://8707777
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@120501 91177308-0d34-0410-b5e6-96231b3b80d8