for all the processors where I have tried it, and even when it might not help
performance, the cost is quite low. The opportunities for duplicating
indirect branches are limited by other factors so code size does not change
much due to tail duplicating indirect branches aggressively.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@90144 91177308-0d34-0410-b5e6-96231b3b80d8
Make tail duplication of indirect branches much more aggressive (for targets
that indicate that it is profitable), based on further experience with
this transformation. I compiled 3 large applications with and without
this more aggressive tail duplication and measured minimal changes in code
size. ("size" on Darwin seems to round the text size up to the nearest
page boundary, so I can only say that any code size increase was less than
one 4k page.) Radar 7421267.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@89814 91177308-0d34-0410-b5e6-96231b3b80d8
than doing the same via constpool:
1. Load from constpool costs 3 cycles on A9, movt/movw pair - just 2.
2. Load from constpool might stall up to 300 cycles due to cache miss.
3. Movt/movw does not use load/store unit.
4. Less constpool entries => better compiler performance.
This is only enabled on ELF systems, since darwin does not have needed
relocations (yet).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@89720 91177308-0d34-0410-b5e6-96231b3b80d8
contents of the block to be duplicated. Use this for ARM Cortex A8/9 to
be more aggressive tail duplicating indirect branches, since it makes it
much more likely that they will be predicted in the branch target buffer.
Testcase coming soon.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@89187 91177308-0d34-0410-b5e6-96231b3b80d8
- If destination is a physical register and it has a subreg index, use the
sub-register instead.
This fixes PR5423.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@88745 91177308-0d34-0410-b5e6-96231b3b80d8
load of a GV from constantpool and then add pc. It allows the code sequence to
be rematerializable so it would be hoisted by machine licm.
- Add a late pass to break these pseudo instructions into a number of real
instructions. Also move the code in Thumb2 IT pass that breaks up t2MOVi32imm
to this pass. This is done before post regalloc scheduling to allow the
scheduler to proper schedule these instructions. It also allow them to be
if-converted and shrunk by later passes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@86304 91177308-0d34-0410-b5e6-96231b3b80d8
instruction. This makes it re-materializable.
Thumb2 will split it back out into two instructions so IT pass will generate the
right mask. Also, this expose opportunies to optimize the movw to a 16-bit move.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@82982 91177308-0d34-0410-b5e6-96231b3b80d8
the only real caller (GetFunctionSizeInBytes) uses it.
The custom ARM implementation of this is basically reimplementing
an assembler poorly for negligible gain. It should be removed
IMNSHO, but I'll leave that to ARMish folks to decide.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@77877 91177308-0d34-0410-b5e6-96231b3b80d8
- This change also makes it possible to switch between ARM / Thumb on a
per-function basis.
- Fixed thumb2 routine which expand reg + arbitrary immediate. It was using
using ARM so_imm logic.
- Use movw and movt to do reg + imm when profitable.
- Other code clean ups and minor optimizations.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@77300 91177308-0d34-0410-b5e6-96231b3b80d8
This also fixes potential problems in ARMBaseInstrInfo routines not recognizing thumb1 instructions when 32-bit and 16-bit instructions mix.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@77218 91177308-0d34-0410-b5e6-96231b3b80d8
Before:
adr r12, #LJTI3_0_0
ldr pc, [r12, +r0, lsl #2]
LJTI3_0_0:
.long LBB3_24
.long LBB3_30
.long LBB3_31
.long LBB3_32
After:
adr r12, #LJTI3_0_0
add pc, r12, +r0, lsl #2
LJTI3_0_0:
b.w LBB3_24
b.w LBB3_30
b.w LBB3_31
b.w LBB3_32
This has several advantages.
1. This will make it easier to optimize this to a TBB / TBH instruction +
(smaller) table.
2. This eliminate the need for ugly asm printer hack to force the address
into thumb addresses (bit 0 is one).
3. Same codegen for pic and non-pic.
4. This eliminate the need to align the table so constantpool island pass
won't have to over-estimate the size.
Based on my calculation, the later is probably slightly faster as well since
ldr pc with shifter address is very slow. That is, it should be a win as long
as the HW implementation can do a reasonable job of branch predict the second
branch.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@77024 91177308-0d34-0410-b5e6-96231b3b80d8