I don't know about mips, I tried ARM, and the llvm code was 10-20% slower than the current gcc. These tests were based on zlib. decompression in itself and compression, and then decompression. both clang and llvm-gcc are used. I preferred clang because -m32 actually runs on a 64-bit host. For the test in question, I found that NOT using -O2 (or -O3) produced the fastest code. tied the bytecode modules into one large module and performed one option with standard optimizations to get the fastest code. llc was -O2 by default, and this helped improve performance.
EDIT:
An interesting test between gcc and llvm / clang for mips.
void dummy ( unsigned int ); void dowait ( void ) { unsigned int ra; for(ra=0x80000;ra;ra--) dummy(ra); }
gcc:
9d006034 <dowait>: 9d006034: 27bdffe8 addiu sp,sp,-24 9d006038: afb00010 sw s0,16(sp) 9d00603c: afbf0014 sw ra,20(sp) 9d006040: 3c100008 lui s0,0x8 9d006044: 02002021 move a0,s0 9d006048: 0f40180a jal 9d006028 <dummy> 9d00604c: 2610ffff addiu s0,s0,-1 9d006050: 1600fffd bnez s0,9d006048 <dowait+0x14> 9d006054: 02002021 move a0,s0 9d006058: 8fbf0014 lw ra,20(sp) 9d00605c: 8fb00010 lw s0,16(sp) 9d006060: 03e00008 jr ra 9d006064: 27bd0018 addiu sp,sp,24
And llvm after assembly
9d006034 <dowait>: 9d006034: 27bdffe8 addiu sp,sp,-24 9d006038: afbf0014 sw ra,20(sp) 9d00603c: afb00010 sw s0,16(sp) 9d006040: 3c020008 lui v0,0x8 9d006044: 34440000 ori a0,v0,0x0 9d006048: 2490ffff addiu s0,a0,-1 9d00604c: 0f40180a jal 9d006028 <dummy> 9d006050: 00000000 nop 9d006054: 00102021 addu a0,zero,s0 9d006058: 1600fffb bnez s0,9d006048 <dowait+0x14> 9d00605c: 00000000 nop 9d006060: 8fb00010 lw s0,16(sp) 9d006064: 8fbf0014 lw ra,20(sp) 9d006068: 27bd0018 addiu sp,sp,24 9d00606c: 03e00008 jr ra 9d006070: 00000000 nop
I speak after assembly because I saw gnu, like all that is.
.globl PUT32 PUT32: sw $a1,0($a0) jr $ra nop
and reinstall the assembly for me:
9d00601c <PUT32>: 9d00601c: 03e00008 jr ra 9d006020: ac850000 sw a1,0(a0) 9d006024: 00000000 nop
The difference between the code generated by llvm and gcc is the instructions that are placed in the branch delay slot. I used clang and llc to build the assembly, and then used binutils, gnu as, to create the binary. So the curiosity is that for my hand the code was compiled:
ori $sp,$sp,0x2000 jal notmain nop
It is optimized for me:
9d006004: 0f401820 jal 9d006080 <notmain> 9d006008: 37bd2000 ori sp,sp,0x2000 9d00600c: 00000000 nop
but generated llc code
addiu $16, $4, -1 jal dummy nop
did not have
9d006048: 2490ffff addiu s0,a0,-1 9d00604c: 0f40180a jal 9d006028 <dummy> 9d006050: 00000000 nop