LLVM backend for 6502
Go to file
Chandler Carruth 828f5b807c [x86] Implement a faster vector population count based on the PSHUFB
in-register LUT technique.

Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html

The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.

On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.

The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.

I have extended Bruno's patch in the following ways:

0) I committed the new tests with baseline sequences so this shows
   a diff, and regenerated the tests using the update scripts.

1) Bruno had noticed and mentioned in IRC a redundant mask that
   I removed.

2) I introduced a particular optimization for the i32 vector cases where
   we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
   + PSADBW to compute doubled high i32 popcounts. This takes advantage
   of the fact that to line up the high i32 popcounts we have to shift
   them anyways, and we can shift them by one fewer bit to effectively
   divide the count by two. While the PSHUFD based horizontal add is no
   faster, it doesn't require registers or load traffic the way a mask
   would, and provides more ILP as it happens on different ports with
   high throughput.

3) I did some code cleanups throughout to simplify the implementation
   logic.

4) I refactored it to continue to use the parallel bitmath lowering when
   SSSE3 is not available to preserve the performance of that version on
   SSE2 targets where it is still much better than scalarizing as we'll
   still do a bitmath implementation of popcount even in scalar code
   there.

With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.

With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.

Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.

Differential Revision: http://reviews.llvm.org/D10084

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@238636 91177308-0d34-0410-b5e6-96231b3b80d8
2015-05-30 03:20:59 +00:00
autoconf [omp] Actually provide a default OpenMP runtime -- libgomp for now. 2015-05-28 02:17:15 +00:00
bindings IR: Give 'DI' prefix to debug info metadata 2015-04-29 16:38:44 +00:00
cmake [CMake] Bug 23468 - LLVM_OPTIMIZED_TABLEGEN does not work with Visual Studio 2015-05-29 18:34:41 +00:00
docs [docs] fix the declarations of the llvm.nvvm.ptr.gen.to.* intrinsics 2015-05-29 22:18:03 +00:00
examples BrainF.cpp: Update CreateCall() according to r237624. 2015-05-19 06:50:19 +00:00
include MC: Clean up MCExpr naming. NFC. 2015-05-30 01:25:56 +00:00
lib [x86] Implement a faster vector population count based on the PSHUFB 2015-05-30 03:20:59 +00:00
projects build: make libunwind a proper project 2015-04-25 01:47:39 +00:00
test [x86] Implement a faster vector population count based on the PSHUFB 2015-05-30 03:20:59 +00:00
tools Replace push_back(Constructor(foo)) with emplace_back(foo) for non-trivial types 2015-05-29 19:43:39 +00:00
unittests YAML traits need to be in the llvm::yaml namespace. 2015-05-29 18:14:55 +00:00
utils Replace push_back(Constructor(foo)) with emplace_back(foo) for non-trivial types 2015-05-29 19:43:39 +00:00
.arcconfig Updated phabricator server. 2014-04-07 03:57:04 +00:00
.clang-format Test commit. 2014-03-02 13:08:46 +00:00
.clang-tidy Enable display of compiler diagnostics in clang-tidy by default. 2014-10-29 17:29:38 +00:00
.gitignore Ignore compile_commands.json only at the root of the tree. 2015-03-26 18:55:42 +00:00
CMakeLists.txt Enable solid lzma compression for cpack, decreases setup size by ~30% 2015-05-14 17:07:41 +00:00
CODE_OWNERS.TXT Added Andrey Churbanov as the owner of the OpenMP runtime library code 2015-05-05 20:17:53 +00:00
configure [omp] Actually provide a default OpenMP runtime -- libgomp for now. 2015-05-28 02:17:15 +00:00
CREDITS.TXT Rise from the dead and update personal info 2014-08-25 17:51:04 +00:00
LICENSE.TXT Update for a new year. 2015-03-12 01:25:29 +00:00
llvm.spec.in
LLVMBuild.txt Remove the very substantial, largely unmaintained legacy PGO 2013-10-02 15:42:23 +00:00
Makefile [configure/make] Propagate names of build host tools when making BuildTools 2014-03-25 21:45:41 +00:00
Makefile.common
Makefile.config.in Deprecate in-source autotools builds 2015-05-04 02:04:54 +00:00
Makefile.rules Add support for SunOS function/data sections and associated 2015-03-03 20:54:29 +00:00
README.txt Revert test commit at revision 233535. 2015-03-30 12:39:03 +00:00

Low Level Virtual Machine (LLVM)
================================

This directory and its subdirectories contain source code for LLVM,
a toolkit for the construction of highly optimized compilers,
optimizers, and runtime environments.

LLVM is open source software. You may freely distribute it under the terms of
the license agreement found in LICENSE.txt.

Please see the documentation provided in docs/ for further
assistance with LLVM, and in particular docs/GettingStarted.rst for getting
started with LLVM and docs/README.txt for an overview of LLVM's
documentation setup.

If you're writing a package for LLVM, see docs/Packaging.rst for our
suggestions.