Is there a "custom internal" function for x64 instead of the built-in build?

I am currently experimenting with creating highly optimized, reusable functions for my library. For example, I write the function "strength 2" as follows:

template<class IntType> inline bool is_power_of_two( const IntType x ) { return (x != 0) && ((x & (x - 1)) == 0); } 

This is a low-maintenance portable version as a built-in C ++ template. This code compiles VC ++ 2008 into the following code with branches:

 is_power_of_two PROC test rcx, rcx je SHORT $LN3@is _power_o lea rax, QWORD PTR [rcx-1] test rax, rcx jne SHORT $LN3@is _power_o mov al, 1 ret 0 $LN3@is _power_o: xor al, al ret 0 is_power_of_two ENDP 

I also found an implementation from here: "Twiddler bit" , which will be encoded in the assembly for x64 as follows:

 is_power_of_two_fast PROC test rcx, rcx je SHORT NotAPowerOfTwo lea rax, [rcx-1] and rax, rcx neg rax sbb rax, rax inc rax ret NotAPowerOfTwo: xor rax, rax ret is_power_of_two_fast ENDP 

I tested both routines written separately from C ++ in the assembly module (.asm file), and the second one is 20% faster!

However, the overhead of calling the function is significant: if I compare the second assembly "is_power_of_two_fast" with the inline'd version of the template function, the latter is faster, despite the branches!

Unfortunately, new conventions for x64 indicate that inline assembly is not allowed. Instead, use the "built-in functions."

Now the question is: can I implement a faster version of is_power_of_two_fast as a user-defined internal function or something similar so that it can be used inline? Or, alternatively, is it possible to somehow force the compiler to create a version of the function with a low branch?

+6
c ++ assembly inline-assembly 64bit intrinsics
source share
4 answers

Even VC 2005 is capable of generating sbb instruction code.

for code C

 bool __declspec(noinline) IsPowOf2(unsigned int a) { return (a>=1)&((a&(a-1))<1); } 

compiles to the next

 00401000 lea eax,[ecx-1] 00401003 and eax,ecx 00401005 cmp eax,1 00401008 sbb eax,eax 0040100A neg eax 0040100C cmp ecx,1 0040100F sbb ecx,ecx 00401011 add ecx,1 00401014 and eax,ecx 00401016 ret 
+2
source share

No, you cannot implement any custom built-in functions, they are all built into the compiler. These are not only instructions that are embedded, but the compiler also knows the semantics of the embedded code and adapts the code for other surrounding code.

One reason for removing the built-in assembly for x86-64 is that inserting the assembly in the middle of the function violates the optimizer and often leads to less well-optimized code around the assembler code. There could easily be a net loss!

The only real use for intrinsics is the โ€œinterestingโ€ special instructions that the compiler cannot generate from C or C ++ constructs such as BSF or BSR. Most other functions will work better using the built-in functions, for example, your template above.

If you need to do something special that the compiler does not understand, the only real option is to write the entire function as a separate assembler module. If the overhead for this feature is too expensive, optimization is probably not worth it in the first place.

Trust your compiler (tm)!

+2
source share

VC10 x64 intrinsics will not have much help in this simple case. The dynamic branching you have is related to the && operator, which is an early operator. In many cases (your case is a great example), it is best to avoid branching by calculating the result for all branches, then apply a mask to choose a good one. The disguised cpp code would look like this:

 template<typename T_Type> inline bool isPowerOfTwo(T_Type const& x) { // static type checking for the example static_assert( std::is_integral<T_Type>::value && std::is_unsigned<T_Type>::value, "limited to unsigned types for the example" ); typedef std::make_signed<T_Type>::type s_Type; // same as yours but with no branching return bool( ((s_Type( s_Type(x != 0) << (s_Type(sizeof(T_Type)<<3u)-1) )) >> (s_Type(s_Type(sizeof(T_Type)<<3u)-1))) & ((x & (x - 1)) == 0) ); } 

In the above code, I do not check if the number is negative or not for signed types. Again, a simple mask will do the trick by doing an arithmetic shift to the right (numBit-1) times to get the value (~ 0) for negative numbers and 0 for positive

+1
source share

The only way forward is to step back a bit and start looking at the larger picture. Either stop implementing the micro-optimized API, or progress in making large API calls optimized in MASM64, YASM, NASM, etc.

If you use one of the more powerful assemblers, you can turn small functions into macros, so basically change the built-in assembler function based on C / C ++ to an assembler file.

0
source share

All Articles