PR2621: Improvements to the SCEV AddRec binomial expansion. This

version uses a new algorithm for evaluating the binomial coefficients 
which is significantly more efficient for AddRecs of more than 2 terms 
(see the comments in the code for details on how the algorithm works).  
It also fixes some bugs: it removes the arbitrary length restriction for 
AddRecs, it fixes the silent generation of incorrect code for AddRecs 
which require a wide calculation width, and it fixes an issue where we 
were incorrectly truncating the iteration count too far when evaluating 
an AddRec expression narrower than the induction variable.

There are still a few related issues I know of: I think there's 
still an issue with the SCEVExpander expansion of AddRec in terms of
the width of the induction variable used.  The hack to avoid generating 
too-wide integers shouldn't be necessary; instead, the callers should be 
considering the cost of the expansion before expanding it (in addition 
to not expanding too-wide integers, we might not want to expand 
expressions that are really expensive, especially when optimizing for 
size; calculating an length-17 32-bit AddRec currently generates about 250 
instructions of straight-line code on X86).  Also, for long 32-bit 
AddRecs on X86, CodeGen really sucks at scheduling the code.  I'm planning on 
filing follow-up PRs for these issues.



git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@54332 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
Eli Friedman 2008-08-04 23:49:06 +00:00
parent 6f498b0a8e
commit b42a626122
3 changed files with 192 additions and 84 deletions

View File

@ -507,91 +507,125 @@ SCEVHandle ScalarEvolution::getMinusSCEV(const SCEVHandle &LHS,
} }
/// BinomialCoefficient - Compute BC(It, K). The result is of the same type as /// BinomialCoefficient - Compute BC(It, K). The result has width W.
/// It. Assume, K > 0. // Assume, K > 0.
static SCEVHandle BinomialCoefficient(SCEVHandle It, unsigned K, static SCEVHandle BinomialCoefficient(SCEVHandle It, unsigned K,
ScalarEvolution &SE) { ScalarEvolution &SE,
const IntegerType* ResultTy) {
// Handle the simplest case efficiently.
if (K == 1)
return SE.getTruncateOrZeroExtend(It, ResultTy);
// We are using the following formula for BC(It, K): // We are using the following formula for BC(It, K):
// //
// BC(It, K) = (It * (It - 1) * ... * (It - K + 1)) / K! // BC(It, K) = (It * (It - 1) * ... * (It - K + 1)) / K!
// //
// Suppose, W is the bitwidth of It (and of the return value as well). We // Suppose, W is the bitwidth of the return value. We must be prepared for
// must be prepared for overflow. Hence, we must assure that the result of // overflow. Hence, we must assure that the result of our computation is
// our computation is equal to the accurate one modulo 2^W. Unfortunately, // equal to the accurate one modulo 2^W. Unfortunately, division isn't
// division isn't safe in modular arithmetic. This means we must perform the // safe in modular arithmetic.
// whole computation accurately and then truncate the result to W bits.
// //
// The dividend of the formula is a multiplication of K integers of bitwidth // However, this code doesn't use exactly that formula; the formula it uses
// W. K*W bits suffice to compute it accurately. // is something like the following, where T is the number of factors of 2 in
// K! (i.e. trailing zeros in the binary representation of K!), and ^ is
// exponentiation:
// //
// FIXME: We assume the divisor can be accurately computed using 16-bit // BC(It, K) = (It * (It - 1) * ... * (It - K + 1)) / 2^T / (K! / 2^T)
// unsigned integer type. It is true up to K = 8 (AddRecs of length 9). In
// future we may use APInt to use the minimum number of bits necessary to
// compute it accurately.
// //
// It is safe to use unsigned division here: the dividend is nonnegative and // This formula is trivially equivalent to the previous formula. However,
// the divisor is positive. // this formula can be implemented much more efficiently. The trick is that
// K! / 2^T is odd, and exact division by an odd number *is* safe in modular
// arithmetic. To do exact division in modular arithmetic, all we have
// to do is multiply by the inverse. Therefore, this step can be done at
// width W.
//
// The next issue is how to safely do the division by 2^T. The way this
// is done is by doing the multiplication step at a width of at least W + T
// bits. This way, the bottom W+T bits of the product are accurate. Then,
// when we perform the division by 2^T (which is equivalent to a right shift
// by T), the bottom W bits are accurate. Extra bits are okay; they'll get
// truncated out after the division by 2^T.
//
// In comparison to just directly using the first formula, this technique
// is much more efficient; using the first formula requires W * K bits,
// but this formula less than W + K bits. Also, the first formula requires
// a division step, whereas this formula only requires multiplies and shifts.
//
// It doesn't matter whether the subtraction step is done in the calculation
// width or the input iteration count's width; if the subtraction overflows,
// the result must be zero anyway. We prefer here to do it in the width of
// the induction variable because it helps a lot for certain cases; CodeGen
// isn't smart enough to ignore the overflow, which leads to much less
// efficient code if the width of the subtraction is wider than the native
// register width.
//
// (It's possible to not widen at all by pulling out factors of 2 before
// the multiplication; for example, K=2 can be calculated as
// It/2*(It+(It*INT_MIN/INT_MIN)+-1). However, it requires
// extra arithmetic, so it's not an obvious win, and it gets
// much more complicated for K > 3.)
// Handle the simplest case efficiently. // Protection from insane SCEVs; this bound is conservative,
if (K == 1) // but it probably doesn't matter.
return It; if (K > 1000)
return new SCEVCouldNotCompute();
assert(K < 9 && "We cannot handle such long AddRecs yet."); unsigned W = ResultTy->getBitWidth();
// FIXME: A temporary hack to remove in future. Arbitrary precision integers
// aren't supported by the code generator yet. For the dividend, the bitwidth
// we use is the smallest power of 2 greater or equal to K*W and less or equal
// to 64. Note that setting the upper bound for bitwidth may still lead to
// miscompilation in some cases.
unsigned DividendBits = 1U << Log2_32_Ceil(K * It->getBitWidth());
if (DividendBits > 64)
DividendBits = 64;
#if 0 // Waiting for the APInt support in the code generator...
unsigned DividendBits = K * It->getBitWidth();
#endif
const IntegerType *DividendTy = IntegerType::get(DividendBits); // Calculate K! / 2^T and T; we divide out the factors of two before
const SCEVHandle ExIt = SE.getTruncateOrZeroExtend(It, DividendTy); // multiplying for calculating K! / 2^T to avoid overflow.
// Other overflow doesn't matter because we only care about the bottom
// The final number of bits we need to perform the division is the maximum of // W bits of the result.
// dividend and divisor bitwidths. APInt OddFactorial(W, 1);
const IntegerType *DivisionTy = unsigned T = 1;
IntegerType::get(std::max(DividendBits, 16U)); for (unsigned i = 3; i <= K; ++i) {
APInt Mult(W, i);
// Compute K! We know K >= 2 here. unsigned TwoFactors = Mult.countTrailingZeros();
unsigned F = 2; T += TwoFactors;
for (unsigned i = 3; i <= K; ++i) Mult = Mult.lshr(TwoFactors);
F *= i; OddFactorial *= Mult;
APInt Divisor(DivisionTy->getBitWidth(), F);
// Handle this case efficiently, it is common to have constant iteration
// counts while computing loop exit values.
if (SCEVConstant *SC = dyn_cast<SCEVConstant>(ExIt)) {
const APInt& N = SC->getValue()->getValue();
APInt Dividend(N.getBitWidth(), 1);
for (; K; --K)
Dividend *= N-(K-1);
if (DividendTy != DivisionTy)
Dividend = Dividend.zext(DivisionTy->getBitWidth());
APInt Result = Dividend.udiv(Divisor);
if (Result.getBitWidth() != It->getBitWidth())
Result = Result.trunc(It->getBitWidth());
return SE.getConstant(Result);
} }
SCEVHandle Dividend = ExIt;
for (unsigned i = 1; i != K; ++i)
Dividend =
SE.getMulExpr(Dividend,
SE.getMinusSCEV(ExIt, SE.getIntegerSCEV(i, DividendTy)));
return SE.getTruncateOrZeroExtend( // We need at least W + T bits for the multiplication step
SE.getUDivExpr( // FIXME: A temporary hack; we round up the bitwidths
SE.getTruncateOrZeroExtend(Dividend, DivisionTy), // to the nearest power of 2 to be nice to the code generator.
SE.getConstant(Divisor) unsigned CalculationBits = 1U << Log2_32_Ceil(W + T);
), It->getType()); // FIXME: Temporary hack to avoid generating integers that are too wide.
// Although, it's not completely clear how to determine how much
// widening is safe; for example, on X86, we can't really widen
// beyond 64 because we need to be able to do multiplication
// that's CalculationBits wide, but on X86-64, we can safely widen up to
// 128 bits.
if (CalculationBits > 64)
return new SCEVCouldNotCompute();
// Calcuate 2^T, at width T+W.
APInt DivFactor = APInt(CalculationBits, 1).shl(T);
// Calculate the multiplicative inverse of K! / 2^T;
// this multiplication factor will perform the exact division by
// K! / 2^T.
APInt Mod = APInt::getSignedMinValue(W+1);
APInt MultiplyFactor = OddFactorial.zext(W+1);
MultiplyFactor = MultiplyFactor.multiplicativeInverse(Mod);
MultiplyFactor = MultiplyFactor.trunc(W);
// Calculate the product, at width T+W
const IntegerType *CalculationTy = IntegerType::get(CalculationBits);
SCEVHandle Dividend = SE.getTruncateOrZeroExtend(It, CalculationTy);
for (unsigned i = 1; i != K; ++i) {
SCEVHandle S = SE.getMinusSCEV(It, SE.getIntegerSCEV(i, It->getType()));
Dividend = SE.getMulExpr(Dividend,
SE.getTruncateOrZeroExtend(S, CalculationTy));
}
// Divide by 2^T
SCEVHandle DivResult = SE.getUDivExpr(Dividend, SE.getConstant(DivFactor));
// Truncate the result, and divide by K! / 2^T.
return SE.getMulExpr(SE.getConstant(MultiplyFactor),
SE.getTruncateOrZeroExtend(DivResult, ResultTy));
} }
/// evaluateAtIteration - Return the value of this chain of recurrences at /// evaluateAtIteration - Return the value of this chain of recurrences at
@ -610,8 +644,10 @@ SCEVHandle SCEVAddRecExpr::evaluateAtIteration(SCEVHandle It,
// The computation is correct in the face of overflow provided that the // The computation is correct in the face of overflow provided that the
// multiplication is performed _after_ the evaluation of the binomial // multiplication is performed _after_ the evaluation of the binomial
// coefficient. // coefficient.
SCEVHandle Val = SE.getMulExpr(getOperand(i), SCEVHandle Val =
BinomialCoefficient(It, i, SE)); SE.getMulExpr(getOperand(i),
BinomialCoefficient(It, i, SE,
cast<IntegerType>(getType())));
Result = SE.getAddExpr(Result, Val); Result = SE.getAddExpr(Result, Val);
} }
return Result; return Result;
@ -2441,17 +2477,8 @@ SCEVHandle ScalarEvolutionsImpl::getSCEVAtScope(SCEV *V, const Loop *L) {
// loop iterates. Compute this now. // loop iterates. Compute this now.
SCEVHandle IterationCount = getIterationCount(AddRec->getLoop()); SCEVHandle IterationCount = getIterationCount(AddRec->getLoop());
if (IterationCount == UnknownValue) return UnknownValue; if (IterationCount == UnknownValue) return UnknownValue;
IterationCount = SE.getTruncateOrZeroExtend(IterationCount,
AddRec->getType());
// If the value is affine, simplify the expression evaluation to just // Then, evaluate the AddRec.
// Start + Step*IterationCount.
if (AddRec->isAffine())
return SE.getAddExpr(AddRec->getStart(),
SE.getMulExpr(IterationCount,
AddRec->getOperand(1)));
// Otherwise, evaluate it the hard way.
return AddRec->evaluateAtIteration(IterationCount, SE); return AddRec->evaluateAtIteration(IterationCount, SE);
} }
return UnknownValue; return UnknownValue;

View File

@ -0,0 +1,25 @@
; RUN: llvm-as < %s | opt -analyze -scalar-evolution -disable-output \
; RUN: -scalar-evolution-max-iterations=0 | grep -F "Exits: 20028"
; PR2621
define i32 @a() nounwind {
entry:
br label %bb1
bb:
trunc i32 %i.0 to i16
add i16 %0, %x16.0
add i32 %i.0, 1
br label %bb1
bb1:
%i.0 = phi i32 [ 0, %entry ], [ %2, %bb ]
%x16.0 = phi i16 [ 0, %entry ], [ %1, %bb ]
icmp ult i32 %i.0, 888888
br i1 %3, label %bb, label %bb2
bb2:
zext i16 %x16.0 to i32
ret i32 %4
}

View File

@ -0,0 +1,56 @@
; RUN: llvm-as < %s | opt -analyze -scalar-evolution -disable-output \
; RUN: -scalar-evolution-max-iterations=0 | grep -F "Exits: -19168"
; PR2621
define i32 @a() nounwind {
entry:
br label %bb1
bb: ; preds = %bb1
add i16 %x17.0, 1 ; <i16>:0 [#uses=2]
add i16 %0, %x16.0 ; <i16>:1 [#uses=2]
add i16 %1, %x15.0 ; <i16>:2 [#uses=2]
add i16 %2, %x14.0 ; <i16>:3 [#uses=2]
add i16 %3, %x13.0 ; <i16>:4 [#uses=2]
add i16 %4, %x12.0 ; <i16>:5 [#uses=2]
add i16 %5, %x11.0 ; <i16>:6 [#uses=2]
add i16 %6, %x10.0 ; <i16>:7 [#uses=2]
add i16 %7, %x9.0 ; <i16>:8 [#uses=2]
add i16 %8, %x8.0 ; <i16>:9 [#uses=2]
add i16 %9, %x7.0 ; <i16>:10 [#uses=2]
add i16 %10, %x6.0 ; <i16>:11 [#uses=2]
add i16 %11, %x5.0 ; <i16>:12 [#uses=2]
add i16 %12, %x4.0 ; <i16>:13 [#uses=2]
add i16 %13, %x3.0 ; <i16>:14 [#uses=2]
add i16 %14, %x2.0 ; <i16>:15 [#uses=2]
add i16 %15, %x1.0 ; <i16>:16 [#uses=1]
add i32 %i.0, 1 ; <i32>:17 [#uses=1]
br label %bb1
bb1: ; preds = %bb, %entry
%x2.0 = phi i16 [ 0, %entry ], [ %15, %bb ] ; <i16> [#uses=1]
%x3.0 = phi i16 [ 0, %entry ], [ %14, %bb ] ; <i16> [#uses=1]
%x4.0 = phi i16 [ 0, %entry ], [ %13, %bb ] ; <i16> [#uses=1]
%x5.0 = phi i16 [ 0, %entry ], [ %12, %bb ] ; <i16> [#uses=1]
%x6.0 = phi i16 [ 0, %entry ], [ %11, %bb ] ; <i16> [#uses=1]
%x7.0 = phi i16 [ 0, %entry ], [ %10, %bb ] ; <i16> [#uses=1]
%x8.0 = phi i16 [ 0, %entry ], [ %9, %bb ] ; <i16> [#uses=1]
%x9.0 = phi i16 [ 0, %entry ], [ %8, %bb ] ; <i16> [#uses=1]
%x10.0 = phi i16 [ 0, %entry ], [ %7, %bb ] ; <i16> [#uses=1]
%x11.0 = phi i16 [ 0, %entry ], [ %6, %bb ] ; <i16> [#uses=1]
%x12.0 = phi i16 [ 0, %entry ], [ %5, %bb ] ; <i16> [#uses=1]
%x13.0 = phi i16 [ 0, %entry ], [ %4, %bb ] ; <i16> [#uses=1]
%x14.0 = phi i16 [ 0, %entry ], [ %3, %bb ] ; <i16> [#uses=1]
%x15.0 = phi i16 [ 0, %entry ], [ %2, %bb ] ; <i16> [#uses=1]
%x16.0 = phi i16 [ 0, %entry ], [ %1, %bb ] ; <i16> [#uses=1]
%x17.0 = phi i16 [ 0, %entry ], [ %0, %bb ] ; <i16> [#uses=1]
%i.0 = phi i32 [ 0, %entry ], [ %17, %bb ] ; <i32> [#uses=2]
%x1.0 = phi i16 [ 0, %entry ], [ %16, %bb ] ; <i16> [#uses=2]
icmp ult i32 %i.0, 8888 ; <i1>:18 [#uses=1]
br i1 %18, label %bb, label %bb2
bb2: ; preds = %bb1
zext i16 %x1.0 to i32 ; <i32>:19 [#uses=1]
ret i32 %19
}