Bit fields: set vs test-and-set (for performance)

I have a large number of instances of a C structure, like this:

struct mystruct { /* ... */ unsigned flag: 1; /* ... */ }; 
  • flag initially 0, but must be 1 when exiting a specific function.

The simplest implementation:

 void set_flag(struct mystruct *sp) { sp->flag = 1U; } 

But what is the likely impact on doing this:

 void set_flag(struct mystruct *sp) { if (sp->flag == 0U) { sp->flag = 1U; } } 

I hope to avoid writing to main memory. The first version always writes, the second version only records if the flag is not already set, but in the vast majority of cases the flag is already set.

What other factors (e.g. branch prediction) can affect performance?

So far, I have seen a slight increase in speed, which, I hope, will become more significant as the data set becomes larger.

Is there a risk of this change making the program slower for large data sets, and if so, in what circumstances can this happen?

+6
optimization with bit-fields
source share
9 answers

The test before the set matters, but how much it depends on your use cases.

Ultimately, the data will be completed in the cache line (for example, just write or test and install).

However, there is a difference if your cache line is marked as dirty (e.g. changed) or clean. Dirty cache lines must be written to main memory, and pure cache lines can simply be forgotten and filled with new data.

Now think that your code manages a huge amount of data, and you only get access to each piece of data once or twice. If so, it can be assumed that most memory accesses are cache misses. What happens if most of your cache lines are dirty at the point where the cache misses and most of the cache lines are dirty?

They must be written back to main memory before new data is loaded into the string. This is slower than just forgetting the contents of the cache line. It will also double the memory bandwidth between the cache and main memory.

This may not affect the processor core, as memory these days is very fast, but another processor (hopefully) will also do some other work. You can be sure that the other CPU core will run a little faster if the bus is not busy moving cache lines.

In short: keeping clear lines in the cache is half the bandwidth requirement and makes cache misses a little cheaper.

Regarding the industry: Of course: it's expensive, but the cache miss is much worse! Also, if you're lucky, the processor will use it outside the order fulfillment function to compensate for cache misses with branch costs.

If you really want to get the best performance from this code, and if most of your hits are missed caches, you have two options:

  • Cache Bypass: The x86 architecture has no temporary loads and stores for this purpose. They are hidden somewhere in the SSE instruction sets and can be used from c-language through built-in tools.

  • (Expert Advisor Only): Use some lines of the built-in assembler that replaces the test-and-set function with assembler that uses the CMOV (conditional move) instruction. This will not only cause your cache lines to be cleared, but not deleted. Now CMOV is a slow instruction and will only outperform a branch if branches cannot be predicted. This way you better check your code.

+10
source share

This is an interesting question, and Niels answer on cache lines is great advice.

I would like to emphasize the importance of profiling code for measuring real performance . Can you measure how often this flag will already be set in the data you encounter? Performance may vary greatly depending on the answer.

Just for fun, I used your code to make a small comparison of the set compared to the test set in an array of 50 million, filled with various proportions 1. Here is the graph:

set vs. comparison test-then-set http://natekohl.net/media/set-comp.png

This, of course, is just a toy example. But pay attention to non-linear performance, which I did not expect, and that the test-then-set becomes faster than usual when the array is almost completely filled with the 1st.

+3
source share

These are my interpretations of your requirement,

  • you have a flag initialized separately
  • it is installed only once (up to 1), and not reset after that
  • But this set attempt will be executed many times on the same flag
  • And you have a lot of these flag instances (each of which needs to process the same type)

Assuming that

  • Space optimization
  • weighted significantly lower than time optimization,

I suggest the following.

  • Firstly, on 32-bit systems, it helps to use 32-bit integers if you are worried about access time
  • If you skip checking the word flag, the entry will be pretty fast. But, given that you have a very large number of flags that you will continue to check and set, if they are not already set, it would be better to keep the conditional check.
  • But, saying that if your platform performs parallel operations (for example, writing to disk can be sent in parallel with code execution usually), it would be advisable to skip the check.
+2
source share

This optimization is not likely to slow down when moving to a larger dataset.

Caching when reading values ​​will be the same, penalties for branch predictions will be the same, and these are key factors for optimization here.

Branch prediction stores a history for each branch instruction, so it doesn’t matter how many instances you have as long as you insert instructions with different addresses on them (for example, as a built-in function). If you have one functional entity (not nested), you will have one transition command for everyone, and this will suppress the branch prediction, forcing it to skip more often and increase fines.

+1
source share

You can always view the profile, but I'm sure the first version will be faster and less obscure.

0
source share

Any approach would require the data to be loaded into the cache, so the only save would be the difference between reading / writing and writing.

I don’t see how this change can make your code slower with large data sets, so you are probably safe enough on this front.

It smells a bit like premature optimization. (If your profiling did not identify this as a bottleneck)

As with all aspects of performance, the best way to make sure code changes is to measure it. You should be able to create a large amount of test data relatively easily.

0
source share

If you really care about time performance, change the flag to a full int instead of a bit field. Then the setup is just a write, not a read-write, as with bit fields.

But, as already mentioned, it smells like micro-optimization.

0
source share

The test before typing makes no sense - code without a test is cleaner and also a little faster.

As a side note, built-in functions like this make sense, because the overhead of calling a function is greater than the body of the function, although the optimizing compiler should do this without a second thought.

0
source share

Since no one said this, I will.

Why are you using a bit field? The layout will differ from compiler to compiler, so they are useless for interfaces. They may or may not be more efficient in area; the compiler can simply decide to put them in a 32-bit field in order to use things efficiently. There is no guarantee that they are faster, and in fact they are likely to be slower.

I have banned their use at work. If someone cannot give me a convincing reason that they provide any additional opportunities, they should not play with them.

0
source share

All Articles