I am currently experimenting with creating highly optimized, reusable functions for my library. For example, I write the function "strength 2" as follows:
template<class IntType> inline bool is_power_of_two( const IntType x ) { return (x != 0) && ((x & (x - 1)) == 0); }
This is a low-maintenance portable version as a built-in C ++ template. This code compiles VC ++ 2008 into the following code with branches:
is_power_of_two PROC test rcx, rcx je SHORT $LN3@is _power_o lea rax, QWORD PTR [rcx-1] test rax, rcx jne SHORT $LN3@is _power_o mov al, 1 ret 0 $LN3@is _power_o: xor al, al ret 0 is_power_of_two ENDP
I also found an implementation from here: "Twiddler bit" , which will be encoded in the assembly for x64 as follows:
is_power_of_two_fast PROC test rcx, rcx je SHORT NotAPowerOfTwo lea rax, [rcx-1] and rax, rcx neg rax sbb rax, rax inc rax ret NotAPowerOfTwo: xor rax, rax ret is_power_of_two_fast ENDP
I tested both routines written separately from C ++ in the assembly module (.asm file), and the second one is 20% faster!
However, the overhead of calling the function is significant: if I compare the second assembly "is_power_of_two_fast" with the inline'd version of the template function, the latter is faster, despite the branches!
Unfortunately, new conventions for x64 indicate that inline assembly is not allowed. Instead, use the "built-in functions."
Now the question is: can I implement a faster version of is_power_of_two_fast as a user-defined internal function or something similar so that it can be used inline? Or, alternatively, is it possible to somehow force the compiler to create a version of the function with a low branch?
c ++ assembly inline-assembly 64bit intrinsics
Angel sinigersky
source share