The save area is twice as big and there is no struct return slot. The
stack pointer is always 16-byte aligned (after adding the bias).
Also eliminate the stack adjustment instructions around calls when the
function has a reserved stack frame.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179083 91177308-0d34-0410-b5e6-96231b3b80d8
The costs are overfitted so that I can still use the legalization factor.
For example the following kernel has about half the throughput vectorized than
unvectorized when compiled with SSE2. Before this patch we would vectorize it.
unsigned short A[1024];
double B[1024];
void f() {
int i;
for (i = 0; i < 1024; ++i) {
B[i] = (double) A[i];
}
}
radar://13599001
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179033 91177308-0d34-0410-b5e6-96231b3b80d8
PowerPC has a conditional branch to the link register (return) instruction: BCLR.
This should be used any time when we'd otherwise have a conditional branch to a
return. This adds a small pass, PPCEarlyReturn, which runs just prior to the
branch selection pass (and, importantly, after block placement) to generate
these conditional returns when possible. It will also eliminate unconditional
branches to returns (these happen rarely; most of the time these have already
been tail duplicated by the time PPCEarlyReturn is invoked). This is a nice
optimization for small functions that do not maintain a stack frame.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179026 91177308-0d34-0410-b5e6-96231b3b80d8
I've managed to convince myself that AArch64's acquire/release
instructions are sufficient to guarantee C++11's required semantics,
even in the sequentially-consistent case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179005 91177308-0d34-0410-b5e6-96231b3b80d8
First, we should not cheat: fsel-based lowering of select_cc is a
finite-math-only optimization (the ISA manual, section F.3 of v2.06, makes
this clear, as does a note in our own README).
This also adds fsel-based lowering of EQ and NE condition codes. As it turned
out, fsel generation was covered by a grand total of zero regression test
cases. I've added some test cases to cover the existing behavior (which is now
finite-math only), as well as the new EQ cases.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@179000 91177308-0d34-0410-b5e6-96231b3b80d8
The code in getTypeConversion attempts to promote the element vector type
before it trys to split or widen the vector.
After it failed finding a legal vector type by promoting it would continue using
the promoted vector element type. Thereby missing legal splitted vector types.
For example the type v32i32 that has a legal split of 4 x v3i32 on x86/sse2
would be transformed to: v32i256 and from there on successively split to:
v16i256, v8i256, v1i256 and then finally ends up as an i64 type.
By resetting the vector element type to the original vector element type that
existed before the promotion the code will attempt to split the vector type to
smaller vector widths of the same type.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178999 91177308-0d34-0410-b5e6-96231b3b80d8
The fix for PR14972 in r177055 introduced a real think-o in the *store*
side, likely because I was much more focused on the load side. While we
can arbitrarily widen (or narrow) a loaded value, we can't arbitrarily
widen a value to be stored, as that changes the width of memory access!
Lock down the code path in the store rewriting which would do this to
only handle the intended circumstance.
All of the existing tests continue to pass, and I've added a test from
the PR.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178974 91177308-0d34-0410-b5e6-96231b3b80d8
a relocation across sections. Do this for DW_AT_stmt list in the
skeleton CU and check the relocations in the debug_info section.
Add a FIXME for multiple CUs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178969 91177308-0d34-0410-b5e6-96231b3b80d8
Integer return values are sign or zero extended by the callee, and
structs up to 32 bytes in size can be returned in registers.
The CC_Sparc64 CallingConv definition is shared between
LowerFormalArguments_64 and LowerReturn_64. Function arguments and
return values are passed in the same registers.
The inreg flag is also used for return values. This is required to handle
C functions returning structs containing floats and ints:
struct ifp {
int i;
float f;
};
struct ifp f(void);
LLVM IR:
define inreg { i32, float } @f() {
...
ret { i32, float } %retval
}
The ABI requires that %retval.i is returned in the high bits of %i0
while %retval.f goes in %f1.
Without the inreg return value attribute, %retval.i would go in %i0 and
%retval.f would go in %f3 which is a more efficient way of returning
%multiple values, but it is not ABI compliant for returning C structs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178966 91177308-0d34-0410-b5e6-96231b3b80d8
64-bit SPARC v9 processes use biased stack and frame pointers, so the
current function's stack frame is located at %sp+BIAS .. %fp+BIAS where
BIAS = 2047.
This makes more local variables directly accessible via [%fp+simm13]
addressing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178965 91177308-0d34-0410-b5e6-96231b3b80d8
There are certain PPC instructions into which we can fold a zero immediate
operand. We can detect such cases by looking at the register class required
by the using operand (so long as it is not otherwise constrained).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178961 91177308-0d34-0410-b5e6-96231b3b80d8
All arguments are formally assigned to stack positions and then promoted
to floating point and integer registers. Since there are more floating
point registers than integer registers, this can cause situations where
floating point arguments are assigned to registers after integer
arguments that where assigned to the stack.
Use the inreg flag to indicate 32-bit fragments of structs containing
both float and int members.
The three-way shadowing between stack, integer, and floating point
registers requires custom argument lowering. The good news is that
return values are passed in the exact same way, and we can share the
code.
Still missing:
- Update LowerReturn to handle structs returned in registers.
- LowerCall.
- Variadic functions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178958 91177308-0d34-0410-b5e6-96231b3b80d8
SITargetLowering::analyzeImmediate() was converting the 64-bit values
to 32-bit and then checking if they were an inline immediate. Some
of these conversions caused this check to succeed and produced
S_MOV instructions with 64-bit immediates, which are illegal.
v2:
- Clean up logic
Reviewed-by: Christian König <christian.koenig@amd.com>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178927 91177308-0d34-0410-b5e6-96231b3b80d8
On cores for which we know the misprediction penalty, and we have
the isel instruction, we can profitably perform early if conversion.
This enables us to replace some small branch sequences with selects
and avoid the potential stalls from mispredicting the branches.
Enabling this feature required implementing canInsertSelect and
insertSelect in PPCInstrInfo; isel code in PPCISelLowering was
refactored to use these functions as well.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178926 91177308-0d34-0410-b5e6-96231b3b80d8
This is the counterpart to commit r160637, except it performs the action
in the bottomup portion of the data flow analysis.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178922 91177308-0d34-0410-b5e6-96231b3b80d8
The normal dataflow sequence in the ARC optimizer consists of the following
states:
Retain -> CanRelease -> Use -> Release
The optimizer before this patch stored the uses that determine the lifetime of
the retainable object pointer when it bottom up hits a retain or when top down
it hits a release. This is correct for an imprecise lifetime scenario since what
we are trying to do is remove retains/releases while making sure that no
``CanRelease'' (which is usually a call) deallocates the given pointer before we
get to the ``Use'' (since that would cause a segfault).
If we are considering the precise lifetime scenario though, this is not
correct. In such a situation, we *DO* care about the previous sequence, but
additionally, we wish to track the uses resulting from the following incomplete
sequences:
Retain -> CanRelease -> Release (TopDown)
Retain <- Use <- Release (BottomUp)
*NOTE* This patch looks large but the most of it consists of updating
test cases. Additionally this fix exposed an additional bug. I removed
the test case that expressed said bug and will recommit it with the fix
in a little bit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178921 91177308-0d34-0410-b5e6-96231b3b80d8
This optimization is unstable at this moment; it
1) block us on a very important application
2) PR15200
3) test6 and test7 in test/Transforms/ScalarRepl/dynamic-vector-gep.ll
(the CHECK command compare the output against wrong result)
I personally believe this optimization should not have any impact on the
autovectorized code, as auto-vectorizer is supposed to put gather/scatter
in a "right" way. Although in theory downstream optimizaters might reveal
some gather/scatter optimization opportunities, the chance is quite slim.
For the hand-crafted vectorizing code, in term of redundancy elimination,
load-CSE, copy-propagation and DSE can collectively achieve the same result,
but in much simpler way. On the other hand, these optimizers are able to
improve the code in a incremental way; in contrast, SROA is sort of all-or-none
approach. However, SROA might slighly win in stack size, as it tries to figure
out a stretch of memory tightenly cover the area accessed by the dynamic index.
rdar://13174884
PR15200
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178912 91177308-0d34-0410-b5e6-96231b3b80d8
llvm-mips-linux green.
llvm-mips-linux runs on a big endian machine. This test passes if I change 'e'
to 'E' in the target data layout string.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178910 91177308-0d34-0410-b5e6-96231b3b80d8
memory operands.
Essentially, this layers an infix calculator on top of the parsing state
machine. The scale on the index register is still expected to be an immediate
__asm mov eax, [eax + ebx*4]
and will not work with more complex expressions. For example,
__asm mov eax, [eax + ebx*(2*2)]
The plus and minus binary operators assume the numeric value of a register is
zero so as to not change the displacement. Register operands should never
be an operand for a multiply or divide operation; the scale*indexreg
expression is always replaced with a zero on the operand stack to prevent
such a case.
rdar://13521380
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178881 91177308-0d34-0410-b5e6-96231b3b80d8
InMemoryStruct is extremely dangerous as it returns data from an internal
buffer when the endiannes doesn't match. This should fix the tests on big
endian hosts.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178875 91177308-0d34-0410-b5e6-96231b3b80d8
When the RuntimeDyldELF::processRelocationRef routine finds the target
symbol of a relocation in the local or global symbol table, it performs
a section-relative relocation:
Value.SectionID = lsi->second.first;
Value.Addend = lsi->second.second;
At this point, however, any Addend that might have been specified in
the original relocation record is lost. This is somewhat difficult to
trigger for relocations within the code section since they usually
do not contain non-zero Addends (when built with the default JIT code
model, in any case). However, the problem can be reliably triggered
by a relocation within the data section caused by code like:
int test[2] = { -1, 0 };
int *p = &test[1];
The initializer of "p" will need a relocation to "test + 4". On
platforms using RelA relocations this means an Addend of 4 is required.
Current code ignores this addend when processing the relocation,
resulting in incorrect execution.
Fixed by taking the Addend into account when processing relocations
to symbols found in the local or global symbol table.
Tested on x86_64-linux and powerpc64-linux.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178869 91177308-0d34-0410-b5e6-96231b3b80d8
Pass down the fact that an operand is going to be a vector of constants.
This should bring the performance of MultiSource/Benchmarks/PAQ8p/paq8p on x86
back. It had degraded to scalar performance due to my pervious shift cost change
that made all shifts expensive on x86.
radar://13576547
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178809 91177308-0d34-0410-b5e6-96231b3b80d8
SSE2 has efficient support for shifts by a scalar. My previous change of making
shifts expensive did not take this into account marking all shifts as expensive.
This would prevent vectorization from happening where it is actually beneficial.
With this change we differentiate between shifts of constants and other shifts.
radar://13576547
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178808 91177308-0d34-0410-b5e6-96231b3b80d8
The DAGCombine logic that recognized a/sqrt(b) and transformed it into
a multiplication by the reciprocal sqrt did not handle cases where the
sqrt and the division were separated by an fpext or fptrunc.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178801 91177308-0d34-0410-b5e6-96231b3b80d8
It had been dropped during the switch to yaml::IO. Also add a test going
from yaml2obj to llvm-readobj. It can be extended as we add more
fields/formats to yaml2obj.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@178786 91177308-0d34-0410-b5e6-96231b3b80d8