What is the fastest way to update a variable provided?

Question

What is the fastest way to update a variable provided?

I have a ptr and a cond condition. I need the fastest way to reset ptr if cond is true or keep ptr unchanged if cond is false . The current implementation is trivial:

 void reset_if_true(void*& ptr, bool cond) { if (cond) ptr = nullptr; }

I know that the above code performance is good, and I cannot expect its optimization to improve. However, this code is called several million times per second, and every small nanosecond is saved.

I was thinking of something that got rid of the branch, for example:

 void* p[] = { ptr, nullptr }; ptr = p[cond];

but I'm not sure if this is the best way to proceed.

+54

c ++ optimization

shrike Jun 21 '16 at 13:19

source share

5 answers

The lowest hanging fruit here is not what you think. As discussed in several other answers, reset_if_true is going to be compiled into machine code, which is as fast as you can reasonably expect to get for what it does. If it's not fast enough, you need to start thinking about how to change what it does. I see two options: one is easy, one is not very simple:

Change the calling convention:
```
 template <class T> inline T* reset_if_true(T* ptr, bool condition) { return condition ? nullptr : ptr; } 
```
and then change the caller (s) to read something like
```
 ptr_var = reset_if_true(ptr_var, expression); 
```
What this does makes it more likely that ptr_var will live in a register during the critical innermost loop that causes reset_if_true millions of times per second, and there will be no memory ptr_var associated with it, ptr_var getting forced exit to memory is the most an expensive thing in your code, as it is now; even more expensive than potentially erroneous industries. (A reasonably good compiler can do this conversion for you if reset_if_true is permeable, but this is not always possible for this.)
Change the surrounding algorithm so that reset_if_true no longer called millions of times per second.
Since you did not tell us what the surrounding algorithm is, I cannot help it. However, I can tell you that doing something related to checking the state millions of times per second probably indicates a quadratic time complexity algorithm or worse, and that always means that you should at least think about searching the best. (Maybe not better, alas.)

+16

zwol Jun 21 '16 at 16:12

source share

As long as we have sizeof(size_t) == sizeof(void*) , nullptr is represented in binary format as 0 and size_t, using all bits (or having std :: uintptr_t), you can do this:

 // typedef std::uintptr_t ptrint_t; // uncomment if you have it typedef size_t ptrint_t; // comment out if you have std::uintptr_t void reset_if_true(void*& ptr, bool cond) { ((ptrint_t&)ptr) &= -ptrint_t( !cond ); }

Note, however, that the time spent from bool to size_t is very implementation dependent and can take a branch by itself.

+11

lorro Jun 21 '16 at 14:13

source share

The code is absolutely simple.

You, of course, do a lot much faster by adding a function (unless the compiler has built it yourself). For example, inlining may mean that the pointer variable that you set to null may remain in the register.

In addition, this code is so simple, if there are any tricks that can be used to speed it up, the compiler will use them.

+5

gnasher729 Jun 21 '16 at 13:50

source share

Update: I repeated my answer.

In the following code, the idea turns the pointer into a number and multiplies it by a number (cond). Note inline . Multiplication can help use an architecture that uses pipelining.

 #include <cstdint> template <typename T> inline T* reset_if_true(T* p, bool cond) { void* ptr = (void*)p; // The optimising compiler (-O3) will get rid of unnecessary variables. intptr_t ptrint; // This is an unrecommended practice. ptrint = (intptr_t)ptr; ptrint = ptrint * cond; // Multiply the integer void* ptr2 = (void*)ptrint; T* ptrv = (T*)ptr2; return ptrv; }

Usage example:

 #include <iostream> #include <vector> void test1(){ //doulbe d = 3.141592; //typedef std::vector<double> mytype; std::vector<double> data = {3,1,4}; auto ptr = &data; std::cout << (void*)ptr << std::endl; auto ptr2 = reset_if_true(ptr, 1); //auto ptr2 = (mytype*)reset_if_true(ptr, 1); std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))).size() << std::endl; std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))).size() << std::endl; std::cout << reset_if_true(ptr, 0) << " is null? " << (reset_if_true(ptr, 0) == NULL) << // Dont dereference a null. std::endl; } void test2(){ double data = 3.141500123; auto ptr = &data; std::cout << (void*)ptr << std::endl; auto ptr2 = reset_if_true(ptr, 1); //auto ptr2 = (mytype*)reset_if_true(ptr, 1); std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))) << std::endl; std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))) << std::endl; std::cout << reset_if_true(ptr, 0) << " is null? " << (reset_if_true(ptr, 0) == NULL) << // Dont dereference a null. std::endl; } int main(){ test1(); test2(); }

Compile these flags: -O3 -std=c++14 . Exit:

 0x5690 0x5690 -> 3 0x5690 -> 3 0 is null? 1 0x5690 0x5690 -> 3.1415 0x5690 -> 3.1415 0 is null? 1

It may have problems with memory alignment if such parameters are used on the -s FORCE_ALIGNED_MEMORY=1 compiler command line. Also see reinterpret_cast . Remember to use -O3 .

Cond can be any non-zero value. There is room for performance improvement here if we know that it is no different from 0 or 1. In this case, you can use int different integer type for cond.

PS. This is an updated answer. The previous answer, which I already mentioned in my answer, had problems. The solution uses intptr_t and, of course, inline .

Used compiler options:

  em++ reset_if_true.cpp -O3 -std=c++14 -o reset_if_true.js node reset_if_true.js

+3

Sohail Si Jun 21 '16 at 13:45

source share

Cody Gray · Accepted Answer · 2016-06-21 14:37

 void reset_if_true(void*& ptr, bool cond) { if (cond) ptr = nullptr; }

The naive decision will undoubtedly be the fastest in most cases. Although it has a branch that can be slow on modern pipelined processors, it is only slow if the branch is incorrectly predicted. Since branch predictors are very good these days, if the value of cond extremely unpredictable, it is likely that a simple conditional branch is the fastest way to write code.

And if this is not so, a good compiler should know this and be able to optimize the code for something better, given the target architecture. What goes to gnasher729 point : just write the code in a simple way and leave the optimization in the hands of the optimizer.

Although this is good advice overall, sometimes it gets too far. If you really need the speed of this code, you need to check and see what the compiler actually does. Check the code of the object that it generates, and make sure that it is reasonable and that the function code becomes nested.

Such a study can be quite revealing. For example, consider x86-64, where branches can be quite expensive in cases where deviation from branching (this is really the only time when this is an interesting question, so let's assume that cond completely unpredictable). Almost all compilers are going to create the following for a naive implementation:

 reset_if_true(void*&, bool): test sil, sil ; test 'cond' je CondIsFalse mov QWORD PTR [rdi], 0 ; set 'ptr' to nullptr, and fall through CondIsFalse: ret

This is about as complicated code as you might imagine. But if you put a branch predictor in a pathological case, this may turn out to be slower than using conditional move:

 reset_if_true(void*&, bool): xor eax, eax ; pre-zero the register RAX test sil, sil ; test 'cond' cmove rax, QWORD PTR [rdi] ; if 'cond' is false, set the register RAX to 'ptr' mov QWORD PTR [rdi], rax ; set 'ptr' to the value in the register RAX ret ; (which is either 'ptr' or 0)

Conditional moves have a relatively high delay, so they are significantly slower than a well-predicted branch, but they can be faster than a completely unpredictable branch. You would expect the compiler to learn about this when targeting the x86 architecture, but would not (at least in this simple example) have knowledge of cond predictability. It assumes a simple case that branch prediction will be on your side and will generate code A instead of code B.

If you decide that you want to cause the compiler to generate unallocated code due to an unpredictable state, you can try the following:

 void reset_if_true_alt(void*& ptr, bool cond) { ptr = (cond) ? nullptr : ptr; }

This succeeds in convincing modern versions of Clang to generate branching code B, but this is complete pessimization in GCC and MSVC. If you did not check the generated assembly, you would not know. If you want to force GCC and MSVC to generate unpacking code, you will have to work harder. For example, you can use the option posted in the question:

 void reset_if_true(void*& ptr, bool cond) { void* p[] = { ptr, nullptr }; ptr = p[cond]; }

When targeting on x86, all compilers generate non-redistributable code for this, but this is not particularly beautiful code. In fact, not one of them creates conditional moves. Instead, you get multiple memory accesses to build an array:

 reset_if_true_alt(void*&, bool): mov rax, QWORD PTR [rdi] movzx esi, sil mov QWORD PTR [rsp-16], 0 mov QWORD PTR [rsp-24], rax mov rax, QWORD PTR [rsp-24+rsi*8] mov QWORD PTR [rdi], rax ret

Ugly and probably very inefficient. I would predict that he gives a conditional version of the transition for his money, even if the branch is incorrectly predicted. Of course, you would need to check this, of course, but this is probably not a very good choice.

If you were still desperate to eliminate a branch on MSVC or GCC, you would need to do something uglier, including reinterpreting the pointer bits and twisting them. Something like:

 void reset_if_true_alt(void*& ptr, bool cond) { std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr); p &= -(!cond); ptr = reinterpret_cast<void*>(p); }

This will give you the following:

 reset_if_true_alt(void*&, bool): xor eax, eax test sil, sil sete al neg eax cdqe and QWORD PTR [rdi], rax ret

Again, here we have more instructions than a simple branch, but at least they are relatively low-cost instructions. A realistic data test will tell you whether the tradeoff is worth it. And give you the rationale that you need to put in the comment if you are going to actually register a code like this.

As soon as I walked down the tiny rabbit hole, I was able to get MSVC and GCC to use conditional move instructions. Apparently, they did not do this optimization because we were dealing with a pointer:

 void reset_if_true_alt(void*& ptr, bool cond) { std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr); ptr = reinterpret_cast<void*>(cond ? 0 : p); }

 reset_if_true_alt(void*&, bool): mov rax, QWORD PTR [rdi] xor edx, edx test sil, sil cmovne rax, rdx mov QWORD PTR [rdi], rax ret

Given CMOVNE latency and a similar amount of instructions, I'm not sure if this will really be faster than the previous version. The test you used will tell you if it was.

Similarly, if we collapse the condition, we save one memory access:

 void reset_if_true_alt(void*& ptr, bool cond) { std::uintptr_t c = (cond ? 0 : -1); reinterpret_cast<std::uintptr_t&>(ptr) &= c; }

 reset_if_true_alt(void*&, bool): xor esi, 1 movzx esi, sil neg rsi and QWORD PTR [rdi], rsi ret

(that GCC. MSVC does something a little different, preferring its characteristic sequence of commands neg , sbb , neg and dec , but the two are morally equivalent. Clang converts it to the same conditional jump that we saw that it generates the above. ) This may be the best code if we need to avoid branches, given that it generates reasonable output for all tested compilers while maintaining (some) readability in the source code.

What is the fastest way to update a variable provided?

More articles: