- Revise previous patches of the same purpose by fixing
*) grep <PA> | not grep <PB> semantically is not the same as
CHECK: <PA>{{^<PB>.*$}} as the former will check all occurrences of <PA>
while the later only check the first match. As the result, CHECK needs
putting in all place where <PA> occurs.
*) grep <PA> | count <N> needs a final CHECK-NOT of the same pattern.
(As 'CHECK-<N>' is proposed for discussion, converting 'grep | count <N>'
where N > 1 is postponed.)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180742 91177308-0d34-0410-b5e6-96231b3b80d8
We need to intialize this to something and since clang does not set
the shader type attribute and clang is used only for compute shaders,
initializing it to COMPUTE seems like the best choice.
Reviewed-by: Christian König <christian.koenig@amd.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180620 91177308-0d34-0410-b5e6-96231b3b80d8
This already helps SSE2 x86 a lot because it lacks an efficient way to
represent a vector select. The long term goal is to enable the backend to match
a canonicalized pattern into a single instruction (e.g. vabs or pabs).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180597 91177308-0d34-0410-b5e6-96231b3b80d8
latency for certain models of the Intel Atom family, by converting
instructions into their equivalent LEA instructions, when it is both
useful and possible to do so.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180573 91177308-0d34-0410-b5e6-96231b3b80d8
For now, we just reschedule instructions that use the copied vregs and
let regalloc elliminate it. I would really like to eliminate the
copies on-the-fly during scheduling, but we need a complete
implementation of repairIntervalsInRange() first.
The general strategy is for the register coalescer to eliminate as
many global copies as possible and shrink live ranges to be
extended-basic-block local. The coalescer should not have to worry
about resolving local copies (e.g. it shouldn't attemp to reorder
instructions). The scheduler is a much better place to deal with local
interference. The coalescer side of this equation needs work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180193 91177308-0d34-0410-b5e6-96231b3b80d8
-- C.4 and C.5 statements, when NSAA is not equal to SP.
-- C.1.cp statement for VA functions. Note: There are no VFP CPRCs in a
variadic procedure.
Before this patch "NSAA != 0" means "don't use GPRs anymore ". But there are
some exceptions in AAPCS.
1. For non VA function: allocate all VFP regs for CPRC. When all VFPs are allocated
CPRCs would be sent to stack, while non CPRCs may be still allocated in GRPs.
2. Check that for VA functions all params uses GPRs and then stack.
No exceptions, no CPRCs here.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@180011 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r179840 with a fix to test/DebugInfo/two-cus-from-same-file.ll
I'm not sure why that test only failed on ARM & MIPS and not X86 Linux, even
though the debug info was clearly invalid on all of them, but this ought to fix
it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179996 91177308-0d34-0410-b5e6-96231b3b80d8
Rather than just splitting the input type and hoping for the best, apply
a bit more cleverness. Just splitting the types until the source is
legal often leads to an illegal result time, which is then widened and a
scalarization step is introduced which leads to truly horrible code
generation. With the loop vectorizer, these sorts of operations are much
more common, and so it's worth extra effort to do them well.
Add a legalization hook for the operands of a TRUNCATE node, which will
be encountered after the result type has been legalized, but if the
operand type is still illegal. If simple splitting of both types
ends up with the result type of each half still being legal, just
do that (v16i16 -> v16i8 on ARM, for example). If, however, that would
result in an illegal result type (v8i32 -> v8i8 on ARM, for example),
we can get more clever with power-two vectors. Specifically,
split the input type, but also widen the result element size, then
concatenate the halves and truncate again. For example on ARM,
To perform a "%res = v8i8 trunc v8i32 %in" we transform to:
%inlo = v4i32 extract_subvector %in, 0
%inhi = v4i32 extract_subvector %in, 4
%lo16 = v4i16 trunc v4i32 %inlo
%hi16 = v4i16 trunc v4i32 %inhi
%in16 = v8i16 concat_vectors v4i16 %lo16, v4i16 %hi16
%res = v8i8 trunc v8i16 %in16
This allows instruction selection to generate three VMOVN instructions
instead of a sequences of moves, stores and loads.
Update the ARMTargetTransformInfo to take this improved legalization
into account.
Consider the simplified IR:
define <16 x i8> @test1(<16 x i32>* %ap) {
%a = load <16 x i32>* %ap
%tmp = trunc <16 x i32> %a to <16 x i8>
ret <16 x i8> %tmp
}
define <8 x i8> @test2(<8 x i32>* %ap) {
%a = load <8 x i32>* %ap
%tmp = trunc <8 x i32> %a to <8 x i8>
ret <8 x i8> %tmp
}
Previously, we would generate the truly hideous:
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _test1
.align 2
_test1: @ @test1
@ BB#0:
push {r7}
mov r7, sp
sub sp, sp, #20
bic sp, sp, #7
add r1, r0, #48
add r2, r0, #32
vld1.64 {d24, d25}, [r0:128]
vld1.64 {d16, d17}, [r1:128]
vld1.64 {d18, d19}, [r2:128]
add r1, r0, #16
vmovn.i32 d22, q8
vld1.64 {d16, d17}, [r1:128]
vmovn.i32 d20, q9
vmovn.i32 d18, q12
vmov.u16 r0, d22[3]
strb r0, [sp, #15]
vmov.u16 r0, d22[2]
strb r0, [sp, #14]
vmov.u16 r0, d22[1]
strb r0, [sp, #13]
vmov.u16 r0, d22[0]
vmovn.i32 d16, q8
strb r0, [sp, #12]
vmov.u16 r0, d20[3]
strb r0, [sp, #11]
vmov.u16 r0, d20[2]
strb r0, [sp, #10]
vmov.u16 r0, d20[1]
strb r0, [sp, #9]
vmov.u16 r0, d20[0]
strb r0, [sp, #8]
vmov.u16 r0, d18[3]
strb r0, [sp, #3]
vmov.u16 r0, d18[2]
strb r0, [sp, #2]
vmov.u16 r0, d18[1]
strb r0, [sp, #1]
vmov.u16 r0, d18[0]
strb r0, [sp]
vmov.u16 r0, d16[3]
strb r0, [sp, #7]
vmov.u16 r0, d16[2]
strb r0, [sp, #6]
vmov.u16 r0, d16[1]
strb r0, [sp, #5]
vmov.u16 r0, d16[0]
strb r0, [sp, #4]
vldmia sp, {d16, d17}
vmov r0, r1, d16
vmov r2, r3, d17
mov sp, r7
pop {r7}
bx lr
.globl _test2
.align 2
_test2: @ @test2
@ BB#0:
push {r7}
mov r7, sp
sub sp, sp, #12
bic sp, sp, #7
vld1.64 {d16, d17}, [r0:128]
add r0, r0, #16
vld1.64 {d20, d21}, [r0:128]
vmovn.i32 d18, q8
vmov.u16 r0, d18[3]
vmovn.i32 d16, q10
strb r0, [sp, #3]
vmov.u16 r0, d18[2]
strb r0, [sp, #2]
vmov.u16 r0, d18[1]
strb r0, [sp, #1]
vmov.u16 r0, d18[0]
strb r0, [sp]
vmov.u16 r0, d16[3]
strb r0, [sp, #7]
vmov.u16 r0, d16[2]
strb r0, [sp, #6]
vmov.u16 r0, d16[1]
strb r0, [sp, #5]
vmov.u16 r0, d16[0]
strb r0, [sp, #4]
ldm sp, {r0, r1}
mov sp, r7
pop {r7}
bx lr
Now, however, we generate the much more straightforward:
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _test1
.align 2
_test1: @ @test1
@ BB#0:
add r1, r0, #48
add r2, r0, #32
vld1.64 {d20, d21}, [r0:128]
vld1.64 {d16, d17}, [r1:128]
add r1, r0, #16
vld1.64 {d18, d19}, [r2:128]
vld1.64 {d22, d23}, [r1:128]
vmovn.i32 d17, q8
vmovn.i32 d16, q9
vmovn.i32 d18, q10
vmovn.i32 d19, q11
vmovn.i16 d17, q8
vmovn.i16 d16, q9
vmov r0, r1, d16
vmov r2, r3, d17
bx lr
.globl _test2
.align 2
_test2: @ @test2
@ BB#0:
vld1.64 {d16, d17}, [r0:128]
add r0, r0, #16
vld1.64 {d18, d19}, [r0:128]
vmovn.i32 d16, q8
vmovn.i32 d17, q9
vmovn.i16 d16, q8
vmov r0, r1, d16
bx lr
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179989 91177308-0d34-0410-b5e6-96231b3b80d8
With a little help from the frontend, it looks like the standard va_*
intrinsics can do the job.
Also clean up an old bitcast hack in LowerVAARG that dealt with
unaligned double loads. Load SDNodes can specify an alignment now.
Still missing: Calling varargs functions with float arguments.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179961 91177308-0d34-0410-b5e6-96231b3b80d8
Previously, when spilling 64-bit paired registers, an LDMIA with both
a FrameIndex and an offset was produced. This kind of instruction
shouldn't exist, and the extra operand was being confused with the
predicate, causing aborts later on.
This removes the invalid 0-offset from the instruction being
produced.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179956 91177308-0d34-0410-b5e6-96231b3b80d8
When matching a compare with a subtract where the arguments of the compare are
swapped w.r.t. the arguments of the subtract, we need to negate the predicates
(or CR bit indices) of the users. This, however, is not the same as inverting
the predicate (negating LT -> GT, but inverting LT -> GE, for example). The ARM
backend seems to do this correctly, but when I adapted the code for the PPC
backend, I introduced an error in this logic.
Comparison optimization is now enabled again by default.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179899 91177308-0d34-0410-b5e6-96231b3b80d8
Adding another CU-wide list, in this case of imported_modules (since they
should be relatively rare, it seemed better to add a list where each element
had a "context" value, rather than add a (usually empty) list to every scope).
This takes care of DW_TAG_imported_module, but to fully address PR14606 we'll
need to expand this to cover DW_TAG_imported_declaration too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179836 91177308-0d34-0410-b5e6-96231b3b80d8
This seems to cause a stage-2 LLVM compile failure (by crashing TableGen); do
I'm disabling this for now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179807 91177308-0d34-0410-b5e6-96231b3b80d8