This is my own version of @Sebastian Redl's answer, which more closely matches this question. I will continue to recognize him for credit + kudos @HansPassant for his comment, which drew my attention to reports that made everything clear - because as soon as I noticed that the compiler was adding sync when writing, the problem was that it was to optimize the bool as much as you would expect.
I had a trivial program that reproduces the problem:
std::atomic_bool foobar(true); //bool foobar = true; long long cnt = 0; long long loops = 400000000ll; void thread_1() { usleep(200000); foobar = false; } void thread_2() { while (loops--) { if (foobar) { ++cnt; } } std::cout << cnt << std::endl; }
The main difference from my source code was that I used usleep() inside a while . This was enough to prevent any optimizations in the while . The cleanup code above gives the same thing as for writing:

but completely different for reading:

We see that in the case of bool (left), clang brought if (foobar) out of the loop. Thus, when I run the bool case, I get:
400000000 real 0m1.044s user 0m1.032s sys 0m0.005s
when I run the atomic_bool case, I get:
95393578 real 0m0.420s user 0m0.414s sys 0m0.003s
Interestingly, the atomic_bool case is faster - I think, because it has only 95 million inc on the counter, opposite to 400 million in the bool case.
Even more crazy, this is interesting. If I move std::cout << cnt << std::endl; from the code stream, after pthread_join() , the loop in the non-atomic case becomes the following:

i.e. no loop. It is just if (foobar!=0) cnt = loops; ! Clever clank. Then execution gives:
400000000 real 0m0.206s user 0m0.001s sys 0m0.002s
while atomic_bool remains unchanged.
Thus, there is more than enough evidence that we should use atomic s. The only thing to remember is not to put usleep() in your tests, because even if it is small, it will prevent quite a few compiler optimizations.