[x86] Fix PR20355 (for real). There are many layers to this bug.

The tale starts with r212808 which attempted to fix inversion of the low and high bits when lowering MUL_LOHI. Sadly, that commit did not include any positive test cases, and just removed some operations from a test case where the actual logic being changed isn't fully visible from the test. What this commit did was two things. First, it reversed the low and high results in the formation of the MERGE_VALUES node for the multiple results. This is entirely correct. Second it changed the shuffles for extracting the low and high components from the i64 results of the multiplies to extract them assuming a big-endian-style encoding of the multiply results. This second change is wrong. There is no big-endian encoding in x86, the results of the multiplies are normal v2i64s: when cast to v4i32, the low i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3. However, the first change wasn't enough to actually fix the bug, which is (I assume) why the second change was also made. There was another bug in the MERGE_VALUES formation: we weren't using a VTList, and so were getting a single result node! When grabbing the *second* result from the node, we got... well.. colud be anything. I think this *appeared* to invert things, but had to be causing other problems as well. Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should have been fine, right? NOOOPE! Because the core bug was never addressed, the test in vector-idiv failed when I fixed the MERGE_VALUES node. Because there are essentially no docs for this node, I had to guess at how to fix it and tried swapping the operands, restoring the order of the original code before r212808. While this "fixed" the test case (in that we produced the write instructions) we were still extracting the wrong elements of the i64s, and thus PR20355 was still broken. This commit essentially reverts the big-endian-style extraction part of r212808 and goes back to the original masks which were correct. Now that the MERGE_VALUES node formation is also correct, everything works. I've also included a more detailed test from PR20355 to make sure this stays fixed. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@214011 91177308-0d34-0410-b5e6-96231b3b80d8
2025-04-03 18:32:50 +00:00 · 2014-07-26 03:46:57 +00:00 · 2014-07-26 03:46:57 +00:00 · 5bce4d8edf
commit 5bce4d8edf
parent 4efdce6eed
2 changed files with 46 additions and 20 deletions
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@ -15444,29 +15444,16 @@ static SDValue LowerMUL_LOHI(SDValue Op, const X86Subtarget *Subtarget,
                             DAG.getNode(Opcode, dl, MulVT, Odd0, Odd1));

  // Shuffle it back into the right order.
-  // The internal representation is big endian.
-  // In other words, a i64 bitcasted to 2 x i32 has its high part at index 0
-  // and its low part at index 1.
-  // Moreover, we have: Mul1 = <ae|cg> ; Mul2 = <bf|dh>
-  // Vector index                0 1   ;          2 3
-  // We want      <ae|bf|cg|dh>
-  // Vector index   0  2  1  3
-  // Since each element is seen as 2 x i32, we get:
-  // high_mask[i] = 2 x vector_index[i]
-  // low_mask[i] = 2 x vector_index[i] + 1
-  // where vector_index = {0, Size/2, 1, Size/2 + 1, ...,
-  //                       Size/2 - 1, Size/2 + Size/2 - 1}
-  // where Size is the number of element of the final vector.
  SDValue Highs, Lows;
  if (VT == MVT::v8i32) {
-    const int HighMask[] = {0, 8, 2, 10, 4, 12, 6, 14};
+    const int HighMask[] = {1, 9, 3, 11, 5, 13, 7, 15};
    Highs = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, HighMask);
-    const int LowMask[] = {1, 9, 3, 11, 5, 13, 7, 15};
+    const int LowMask[] = {0, 8, 2, 10, 4, 12, 6, 14};
    Lows = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, LowMask);
  } else {
-    const int HighMask[] = {0, 4, 2, 6};
+    const int HighMask[] = {1, 5, 3, 7};
    Highs = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, HighMask);
-    const int LowMask[] = {1, 5, 3, 7};
+    const int LowMask[] = {1, 4, 2, 6};
    Lows = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, LowMask);
  }

@ -15484,9 +15471,9 @@ static SDValue LowerMUL_LOHI(SDValue Op, const X86Subtarget *Subtarget,
    Highs = DAG.getNode(ISD::SUB, dl, VT, Highs, Fixup);
  }

-  // THe first result of MUL_LOHI is actually the high value, followed by the
-  // low value.
-  SDValue Ops[] = {Highs, Lows};
+  // The first result of MUL_LOHI is actually the low value, followed by the
+  // high value.
+  SDValue Ops[] = {Lows, Highs};
  return DAG.getMergeValues(Ops, dl);
 }

--- a/test/CodeGen/X86/vector-idiv.ll
+++ b/test/CodeGen/X86/vector-idiv.ll
@ -220,3 +220,42 @@ define <2 x i16> @test12() {
 ; AVX-LABEL: test12:
 ; AVX: xorps
 }
+
+define <4 x i32> @PR20355(<4 x i32> %a) {
+; SSE-LABEL: PR20355:
+; SSE:         movdqa {{.*}}, %[[X1:xmm[0-9]+]]
+; SSE-NEXT:    movdqa %[[X1]], %[[X2:xmm[0-9]+]]
+; SSE-NEXT:    psrad  $31, %[[X2]]
+; SSE-NEXT:    pand   %xmm0, %[[X2]]
+; SSE-NEXT:    movdqa %xmm0, %[[X3:xmm[0-9]+]]
+; SSE-NEXT:    psrad  $31, %[[X3]]
+; SSE-NEXT:    pand   %[[X1]], %[[X3]]
+; SSE-NEXT:    paddd  %[[X2]], %[[X3]]
+; SSE-NEXT:    pshufd {{.*}} # [[X4:xmm[0-9]+]] = xmm0[1,0,3,0]
+; SSE-NEXT:    pmuludq %[[X1]], %xmm0
+; SSE-NEXT:    pshufd {{.*}} # [[X1]] = [[X1]][1,0,3,0]
+; SSE-NEXT:    pmuludq %[[X4]], %[[X1]]
+; SSE-NEXT:    shufps {{.*}} # xmm0 = xmm0[1,3],[[X1]][1,3]
+; SSE-NEXT:    pshufd {{.*}} # [[X5:xmm[0-9]+]] = xmm0[0,2,1,3]
+; SSE-NEXT:    psubd  %[[X3]], %[[X5]]
+; SSE-NEXT:    movdqa %[[X5]], %xmm0
+; SSE-NEXT:    psrld  $31, %xmm0
+; SSE-NEXT:    paddd  %[[X5]], %xmm0
+; SSE-NEXT:    retq
+;
+; SSE41-LABEL: PR20355:
+; SSE41:         movdqa {{.*}}, %[[X1:xmm[0-9]+]]
+; SSE41-NEXT:    pshufd {{.*}} # [[X2:xmm[0-9]+]] = xmm0[1,0,3,0]
+; SSE41-NEXT:    pmuldq %[[X1]], %xmm0
+; SSE41-NEXT:    pshufd {{.*}} # [[X1]] = [[X1]][1,0,3,0]
+; SSE41-NEXT:    pmuldq %[[X2]], %[[X1]]
+; SSE41-NEXT:    shufps {{.*}} # xmm0 = xmm0[1,3],[[X1]][1,3]
+; SSE41-NEXT:    pshufd {{.*}} # [[X3:xmm[0-9]+]] = xmm0[0,2,1,3]
+; SSE41-NEXT:    movdqa %[[X3]], %xmm0
+; SSE41-NEXT:    psrld  $31, %xmm0
+; SSE41-NEXT:    paddd  %[[X3]], %xmm0
+; SSE41-NEXT:    retq
+entry:
+  %sdiv = sdiv <4 x i32> %a, <i32 3, i32 3, i32 3, i32 3>
+  ret <4 x i32> %sdiv
+}