Manual performance mismatch and manual assembly

I played using assembler in Go, and I wrote the Hamming Weight function as an exercise.

I based the native version of Go on this SO answer , and the build version is based on this document from AMD (p. 180) . After comparing the two functions, I found that the native version of Go is about 1.5 times - 2x faster than the build version, despite the fact that the handwritten version of the assembly is almost identical to the output from go tool 6g -S popcount.go .

output from go test -bench=.

 PASS BenchmarkPopCount 100000000 19.4 ns/op BenchmarkPopCount_g 200000000 8.97 ns/op ok popcount 4.777s 

popcount.go

 package popcount func popCount(i uint32) uint32 // Defined in popcount_amd64.s func popCount_g(i uint32) uint32 { i = i - ((i >> 1) & 0x55555555) i = (i & 0x33333333) + ((i >> 2) & 0x33333333) return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24 } 

popcount_test.go

 package popcount import "testing" func TestPopcount(t *testing.T) { for i := uint32(0); i < uint32(100); i++ { if popCount(i) != popCount_g(i) { t.Fatalf("failed on input = %v", i) } } } func BenchmarkPopCount(b *testing.B) { for i := 0; i < bN; i++ { popCount(uint32(i)) } } func BenchmarkPopCount_g(b *testing.B) { for i := 0; i < bN; i++ { popCount_g(uint32(i)) } } 

popcount_amd64.s

 // func popCount(i uint32) uint32 TEXT ·popCount(SB),$0 MOVL i+0(FP), BP // i MOVL BP, BX // i SHRL $1, BX // i >> 1 ANDL $0x055555555, BX // (i >> 1) & 0x55555555 SUBL BX, BP // w = i - ((i >> 1) & 0x55555555) MOVL BP, AX // w SHRL $2, BP // w >> 2 ANDL $0x033333333, AX // w & 0x33333333 ANDL $0x033333333, BP // (w >> 2) & 0x33333333 ADDL BP, AX // x = (w & 0x33333333) + ((w >> 2) & 0x33333333) MOVL AX, BX // x SHRL $4, BX // x >> 4 ADDL AX, BX // x + (x >> 4) ANDL $0x00F0F0F0F, BX // y = (x + (x >> 4) & 0x0F0F0F0F) IMULL $0x001010101, BX // y * 0x01010101 SHRL $24, BX // population count = (y * 0x01010101) >> 24 MOVL BX, toReturn+8(FP) // Store result. RET // return 

output from go tool 6g -S popcount.go

 "".popCount_g t=1 size=64 value=0 args=0x10 locals=0 000000 00000 (popcount.go:5) TEXT "".popCount_g+0(SB),4,$0-16 000000 00000 (popcount.go:5) NOP , 000000 00000 (popcount.go:5) NOP , 000000 00000 (popcount.go:5) MOVL "".i+8(FP),BP 0x0004 00004 (popcount.go:5) FUNCDATA $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB) 0x0004 00004 (popcount.go:5) FUNCDATA $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB) 0x0004 00004 (popcount.go:6) MOVL BP,BX 0x0006 00006 (popcount.go:6) SHRL $1,BX 0x0008 00008 (popcount.go:6) ANDL $1431655765,BX 0x000e 00014 (popcount.go:6) SUBL BX,BP 0x0010 00016 (popcount.go:7) MOVL BP,AX 0x0012 00018 (popcount.go:7) ANDL $858993459,AX 0x0017 00023 (popcount.go:7) SHRL $2,BP 0x001a 00026 (popcount.go:7) ANDL $858993459,BP 0x0020 00032 (popcount.go:7) ADDL BP,AX 0x0022 00034 (popcount.go:8) MOVL AX,BX 0x0024 00036 (popcount.go:8) SHRL $4,BX 0x0027 00039 (popcount.go:8) ADDL AX,BX 0x0029 00041 (popcount.go:8) ANDL $252645135,BX 0x002f 00047 (popcount.go:8) IMULL $16843009,BX 0x0035 00053 (popcount.go:8) SHRL $24,BX 0x0038 00056 (popcount.go:8) MOVL BX,"".~r1+16(FP) 0x003c 00060 (popcount.go:8) RET , 

I know from here that the FUNCDATA lines contain information for the garbage collector, but apart from this, I do not see any glaring differences.

What could be causing this big speed difference between the two functions?

+7
assembly go hamming distance
source share
2 answers

If you look at the pseudo assembler for funnction calls, you will see that the .s (assembler) version is being called, with overhead and the .go version.

 func S() { pc := popCount(uint32(0)) _ = pc } "".S t=1 size=48 value=0 args=0x0 locals=0x10 0x0000 00000 (popcount.go:11) TEXT "".S+0(SB),$16-0 0x0000 00000 (popcount.go:11) MOVQ (TLS),CX 0x0009 00009 (popcount.go:11) CMPQ SP,(CX) 0x000c 00012 (popcount.go:11) JHI ,21 0x000e 00014 (popcount.go:11) CALL ,runtime.morestack00_noctxt(SB) 0x0013 00019 (popcount.go:11) JMP ,0 0x0015 00021 (popcount.go:11) SUBQ $16,SP 0x0019 00025 (popcount.go:11) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB) 0x0019 00025 (popcount.go:11) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB) 0x0019 00025 (popcount.go:12) MOVL $0,(SP) 0x0020 00032 (popcount.go:12) PCDATA $1,$0 0x0020 00032 (popcount.go:12) CALL ,"".popCount(SB) 0x0025 00037 (popcount.go:12) MOVL 8(SP),BX 0x0029 00041 (popcount.go:12) NOP , 0x0029 00041 (popcount.go:14) ADDQ $16,SP 0x002d 00045 (popcount.go:14) RET , func S_G() { pc := popCount_g(uint32(0)) _ = pc } "".S_G t=1 size=64 value=0 args=0x0 locals=0x8 0x0000 00000 (popcount.go:16) TEXT "".S_G+0(SB),4,$8-0 0x0000 00000 (popcount.go:16) SUBQ $8,SP 0x0004 00004 (popcount.go:16) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB) 0x0004 00004 (popcount.go:16) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB) 0x0004 00004 (popcount.go:17) MOVL $0,BP 0x0006 00006 (popcount.go:17) MOVL BP,BX 0x0008 00008 (popcount.go:17) SHRL $1,BX 0x000a 00010 (popcount.go:17) ANDL $1431655765,BX 0x0010 00016 (popcount.go:17) SUBL BX,BP 0x0012 00018 (popcount.go:17) MOVL BP,AX 0x0014 00020 (popcount.go:17) ANDL $858993459,AX 0x0019 00025 (popcount.go:17) SHRL $2,BP 0x001c 00028 (popcount.go:17) ANDL $858993459,BP 0x0022 00034 (popcount.go:17) ADDL BP,AX 0x0024 00036 (popcount.go:17) MOVL AX,BX 0x0026 00038 (popcount.go:17) SHRL $4,BX 0x0029 00041 (popcount.go:17) ADDL AX,BX 0x002b 00043 (popcount.go:17) ANDL $252645135,BX 0x0031 00049 (popcount.go:17) IMULL $16843009,BX 0x0037 00055 (popcount.go:17) SHRL $24,BX 0x003a 00058 (popcount.go:19) ADDQ $8,SP 0x003e 00062 (popcount.go:19) RET , 
+8
source share

If you really want popcnt, this has been a native instruction for X86 processors for a while. Your compiler may have a built-in for it, such as Microsoft __popcnt () (also __popcnt16 () and __popcnt64 ()) or GCC __builtin_popcount () (or __builtin_popcountll () for 64 bits).

+3
source share

All Articles