llvm-6502/lib/Target/X86/README.txt

//===---------------------------------------------------------------------===//
// Random ideas for the X86 backend.
//===---------------------------------------------------------------------===//

Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
Hi and Lo parts (combination of MUL and MULH[SU] into one node).  Add this to
X86, & make the dag combiner produce it when needed.  This will eliminate one
imul from the code generated for:

long long test(long long X, long long Y) { return X*Y; }

by using the EAX result from the mul.  We should add a similar node for
DIVREM.

another case is:

long long test(int X, int Y) { return (long long)X*Y; }

... which should only be one imul instruction.

//===---------------------------------------------------------------------===//

This should be one DIV/IDIV instruction, not a libcall:

unsigned test(unsigned long long X, unsigned Y) {
        return X/Y;
}

This can be done trivially with a custom legalizer.  What about overflow 
though?  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224

//===---------------------------------------------------------------------===//

Some targets (e.g. athlons) prefer freep to fstp ST(0):
http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html

//===---------------------------------------------------------------------===//

This should use fiadd on chips where it is profitable:
double foo(double P, int *I) { return P+*I; }

//===---------------------------------------------------------------------===//

The FP stackifier needs to be global.  Also, it should handle simple permutates
to reduce number of shuffle instructions, e.g. turning:

fld P	->		fld Q
fld Q			fld P
fxch

or:

fxch	->		fucomi
fucomi			jl X
jg X

Ideas:
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html


//===---------------------------------------------------------------------===//

Improvements to the multiply -> shift/add algorithm:
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html

//===---------------------------------------------------------------------===//

Improve code like this (occurs fairly frequently, e.g. in LLVM):
long long foo(int x) { return 1LL << x; }

http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html

Another useful one would be  ~0ULL >> X and ~0ULL << X.

//===---------------------------------------------------------------------===//

Should support emission of the bswap instruction, probably by adding a new
DAG node for byte swapping.  Also useful on PPC which has byte-swapping loads.

//===---------------------------------------------------------------------===//

Compile this:
_Bool f(_Bool a) { return a!=1; }

into:
        movzbl  %dil, %eax
        xorl    $1, %eax
        ret

//===---------------------------------------------------------------------===//

Some isel ideas:

1. Dynamic programming based approach when compile time if not an
   issue.
2. Code duplication (addressing mode) during isel.
3. Other ideas from "Register-Sensitive Selection, Duplication, and
   Sequencing of Instructions".

//===---------------------------------------------------------------------===//

Should we promote i16 to i32 to avoid partial register update stalls?

//===---------------------------------------------------------------------===//

Leave any_extend as pseudo instruction and hint to register
allocator. Delay codegen until post register allocation.

//===---------------------------------------------------------------------===//

Add a target specific hook to DAG combiner to handle SINT_TO_FP and
FP_TO_SINT when the source operand is already in memory.

//===---------------------------------------------------------------------===//

Check if load folding would add a cycle in the dag.

//===---------------------------------------------------------------------===//

Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.

	cmpl $1, %eax
	setg %al
	testb %al, %al  # unnecessary
	jne .BB7

//===---------------------------------------------------------------------===//

Count leading zeros and count trailing zeros:

int clz(int X) { return __builtin_clz(X); }
int ctz(int X) { return __builtin_ctz(X); }

$ gcc t.c -S -o - -O3  -fomit-frame-pointer -masm=intel
clz:
        bsr     %eax, DWORD PTR [%esp+4]
        xor     %eax, 31
        ret
ctz:
        bsf     %eax, DWORD PTR [%esp+4]
        ret

however, check that these are defined for 0 and 32.  Our intrinsics are, GCC's
aren't.

//===---------------------------------------------------------------------===//

Use push/pop instructions in prolog/epilog sequences instead of stores off 
ESP (certain code size win, perf win on some [which?] processors).

//===---------------------------------------------------------------------===//

Only use inc/neg/not instructions on processors where they are faster than
add/sub/xor.  They are slower on the P4 due to only updating some processor
flags.

//===---------------------------------------------------------------------===//

Open code rint,floor,ceil,trunc:
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html
http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html

//===---------------------------------------------------------------------===//

Combine: a = sin(x), b = cos(x) into a,b = sincos(x).

//===---------------------------------------------------------------------===//

For all targets, not just X86:
When llvm.memcpy, llvm.memset, or llvm.memmove are lowered, they should be 
optimized to a few store instructions if the source is constant and the length
is smallish (< 8). This will greatly help some tests like Shootout/strcat.c

//===---------------------------------------------------------------------===//

Solve this DAG isel folding deficiency:

int X, Y;

void fn1(void)
{
  X = X | (Y << 3);
}

compiles to

fn1:
	movl Y, %eax
	shll $3, %eax
	orl X, %eax
	movl %eax, X
	ret

The problem is the store's chain operand is not the load X but rather
a TokenFactor of the load X and load Y, which prevents the folding.

There are two ways to fix this:

1. The dag combiner can start using alias analysis to realize that y/x
   don't alias, making the store to X not dependent on the load from Y.
2. The generated isel could be made smarter in the case it can't
   disambiguate the pointers.

Number 1 is the preferred solution.
Put some of my random notes somewhere public git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@23897 91177308-0d34-0410-b5e6-96231b3b80d8 2005-10-23 19:52:42 +00:00			`//===---------------------------------------------------------------------===//`
			`// Random ideas for the X86 backend.`
			`//===---------------------------------------------------------------------===//`

			`Add a MUL2U and MUL2S nodes to represent a multiply that returns both the`
			`Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to`
			`X86, & make the dag combiner produce it when needed. This will eliminate one`
			`imul from the code generated for:`

			`long long test(long long X, long long Y) { return X*Y; }`

			`by using the EAX result from the mul. We should add a similar node for`
			`DIVREM.`

add a note git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@24572 91177308-0d34-0410-b5e6-96231b3b80d8 2005-12-02 00:11:20 +00:00			`another case is:`

			`long long test(int X, int Y) { return (long long)X*Y; }`

			`... which should only be one imul instruction.`

Put some of my random notes somewhere public git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@23897 91177308-0d34-0410-b5e6-96231b3b80d8 2005-10-23 19:52:42 +00:00			`//===---------------------------------------------------------------------===//`

			`This should be one DIV/IDIV instruction, not a libcall:`

			`unsigned test(unsigned long long X, unsigned Y) {`
			`return X/Y;`
			`}`

			`This can be done trivially with a custom legalizer. What about overflow`
			`though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224`

			`//===---------------------------------------------------------------------===//`

			`Some targets (e.g. athlons) prefer freep to fstp ST(0):`
			`http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html`

			`//===---------------------------------------------------------------------===//`

Fix sint_to_fp (fild*) support. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25257 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-12 22:54:21 +00:00			`This should use fiadd on chips where it is profitable:`
Put some of my random notes somewhere public git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@23897 91177308-0d34-0410-b5e6-96231b3b80d8 2005-10-23 19:52:42 +00:00			`double foo(double P, int I) { return P+I; }`

			`//===---------------------------------------------------------------------===//`

			`The FP stackifier needs to be global. Also, it should handle simple permutates`
			`to reduce number of shuffle instructions, e.g. turning:`

			`fld P -> fld Q`
			`fld Q fld P`
			`fxch`

			`or:`

			`fxch -> fucomi`
			`fucomi jl X`
			`jg X`

transfer some notes from my email to somewhere useful. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25361 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-16 17:53:00 +00:00			`Ideas:`
			`http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html`


Put some of my random notes somewhere public git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@23897 91177308-0d34-0410-b5e6-96231b3b80d8 2005-10-23 19:52:42 +00:00			`//===---------------------------------------------------------------------===//`

			`Improvements to the multiply -> shift/add algorithm:`
			`http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html`

			`//===---------------------------------------------------------------------===//`

			`Improve code like this (occurs fairly frequently, e.g. in LLVM):`
			`long long foo(int x) { return 1LL << x; }`

			`http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html`
			`http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html`
			`http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html`

			`Another useful one would be ~0ULL >> X and ~0ULL << X.`

add a note that Nate mentioned last week git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@23898 91177308-0d34-0410-b5e6-96231b3b80d8 2005-10-23 21:44:59 +00:00			`//===---------------------------------------------------------------------===//`

			`Should support emission of the bswap instruction, probably by adding a new`
			`DAG node for byte swapping. Also useful on PPC which has byte-swapping loads.`

Add a missed optimization git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@24495 91177308-0d34-0410-b5e6-96231b3b80d8 2005-11-28 04:52:39 +00:00			`//===---------------------------------------------------------------------===//`

			`Compile this:`
			`_Bool f(_Bool a) { return a!=1; }`

			`into:`
			`movzbl %dil, %eax`
			`xorl $1, %eax`
			`ret`
Added some isel ideas. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@24757 91177308-0d34-0410-b5e6-96231b3b80d8 2005-12-17 01:25:19 +00:00
			`//===---------------------------------------------------------------------===//`

			`Some isel ideas:`

			`1. Dynamic programming based approach when compile time if not an`
			`issue.`
			`2. Code duplication (addressing mode) during isel.`
			`3. Other ideas from "Register-Sensitive Selection, Duplication, and`
			`Sequencing of Instructions".`

			`//===---------------------------------------------------------------------===//`

			`Should we promote i16 to i32 to avoid partial register update stalls?`
Added an idea about any_extend for performance tuning. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@24763 91177308-0d34-0410-b5e6-96231b3b80d8 2005-12-17 06:54:43 +00:00
			`//===---------------------------------------------------------------------===//`

			`Leave any_extend as pseudo instruction and hint to register`
			`allocator. Delay codegen until post register allocation.`
Fix sint_to_fp (fild*) support. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25257 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-12 22:54:21 +00:00
			`//===---------------------------------------------------------------------===//`

			`Add a target specific hook to DAG combiner to handle SINT_TO_FP and`
			`FP_TO_SINT when the source operand is already in memory.`

			`//===---------------------------------------------------------------------===//`

			`Check if load folding would add a cycle in the dag.`
Minor update. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25263 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-13 01:20:42 +00:00
			`//===---------------------------------------------------------------------===//`

			`Model X86 EFLAGS as a real register to avoid redudant cmp / test. e.g.`

			`cmpl $1, %eax`
			`setg %al`
			`testb %al, %al # unnecessary`
			`jne .BB7`
transfer some notes from my email to somewhere useful. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25361 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-16 17:53:00 +00:00
			`//===---------------------------------------------------------------------===//`

			`Count leading zeros and count trailing zeros:`

			`int clz(int X) { return __builtin_clz(X); }`
			`int ctz(int X) { return __builtin_ctz(X); }`

			`$ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel`
			`clz:`
			`bsr %eax, DWORD PTR [%esp+4]`
			`xor %eax, 31`
			`ret`
			`ctz:`
			`bsf %eax, DWORD PTR [%esp+4]`
			`ret`

			`however, check that these are defined for 0 and 32. Our intrinsics are, GCC's`
			`aren't.`

			`//===---------------------------------------------------------------------===//`

			`Use push/pop instructions in prolog/epilog sequences instead of stores off`
			`ESP (certain code size win, perf win on some [which?] processors).`

			`//===---------------------------------------------------------------------===//`

			`Only use inc/neg/not instructions on processors where they are faster than`
			`add/sub/xor. They are slower on the P4 due to only updating some processor`
			`flags.`

			`//===---------------------------------------------------------------------===//`

			`Open code rint,floor,ceil,trunc:`
			`http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html`
			`http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html`

			`//===---------------------------------------------------------------------===//`

			`Combine: a = sin(x), b = cos(x) into a,b = sincos(x).`

Added notes about a x86 isel deficiency. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25706 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-27 22:11:01 +00:00			`//===---------------------------------------------------------------------===//`

Add a note about lowering llvm.memset, llvm.memcpy, and llvm.memmove to a few stores under certain conditions. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25777 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-29 06:48:25 +00:00			`For all targets, not just X86:`
			`When llvm.memcpy, llvm.memset, or llvm.memmove are lowered, they should be`
			`optimized to a few store instructions if the source is constant and the length`
			`is smallish (< 8). This will greatly help some tests like Shootout/strcat.c`

			`//===---------------------------------------------------------------------===//`

Added notes about a x86 isel deficiency. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25706 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-27 22:11:01 +00:00			`Solve this DAG isel folding deficiency:`

			`int X, Y;`

			`void fn1(void)`
			`{`
			`X = X \| (Y << 3);`
			`}`

			`compiles to`

			`fn1:`
			`movl Y, %eax`
			`shll $3, %eax`
			`orl X, %eax`
			`movl %eax, X`
			`ret`

			`The problem is the store's chain operand is not the load X but rather`
A bit of wisdom from Chris on the last entry. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@25715 91177308-0d34-0410-b5e6-96231b3b80d8 2006-01-27 22:54:32 +00:00			`a TokenFactor of the load X and load Y, which prevents the folding.`

			`There are two ways to fix this:`

			`1. The dag combiner can start using alias analysis to realize that y/x`
			`don't alias, making the store to X not dependent on the load from Y.`
			`2. The generated isel could be made smarter in the case it can't`
			`disambiguate the pointers.`

			`Number 1 is the preferred solution.`