The test before the set matters, but how much it depends on your use cases.
Ultimately, the data will be completed in the cache line (for example, just write or test and install).
However, there is a difference if your cache line is marked as dirty (e.g. changed) or clean. Dirty cache lines must be written to main memory, and pure cache lines can simply be forgotten and filled with new data.
Now think that your code manages a huge amount of data, and you only get access to each piece of data once or twice. If so, it can be assumed that most memory accesses are cache misses. What happens if most of your cache lines are dirty at the point where the cache misses and most of the cache lines are dirty?
They must be written back to main memory before new data is loaded into the string. This is slower than just forgetting the contents of the cache line. It will also double the memory bandwidth between the cache and main memory.
This may not affect the processor core, as memory these days is very fast, but another processor (hopefully) will also do some other work. You can be sure that the other CPU core will run a little faster if the bus is not busy moving cache lines.
In short: keeping clear lines in the cache is half the bandwidth requirement and makes cache misses a little cheaper.
Regarding the industry: Of course: it's expensive, but the cache miss is much worse! Also, if you're lucky, the processor will use it outside the order fulfillment function to compensate for cache misses with branch costs.
If you really want to get the best performance from this code, and if most of your hits are missed caches, you have two options:
Cache Bypass: The x86 architecture has no temporary loads and stores for this purpose. They are hidden somewhere in the SSE instruction sets and can be used from c-language through built-in tools.
(Expert Advisor Only): Use some lines of the built-in assembler that replaces the test-and-set function with assembler that uses the CMOV (conditional move) instruction. This will not only cause your cache lines to be cleared, but not deleted. Now CMOV is a slow instruction and will only outperform a branch if branches cannot be predicted. This way you better check your code.
Nils pipenbrinck
source share