In C #, copying a member variable to a local stack variable improves performance?

I quite often write code that copies member variables into a local stack variable, believing that it will improve performance by removing the dereferencing pointer that should occur when accessing member variables.

It's really?

for example

public class Manager { private readonly Constraint[] mConstraints; public void DoSomethingPossiblyFaster() { var constraints = mConstraints; for (var i = 0; i < constraints.Length; i++) { var constraint = constraints[i]; // Do something with it } } public void DoSomethingPossiblySlower() { for (var i = 0; i < mConstraints.Length; i++) { var constraint = mConstraints[i]; // Do something with it } } } 

My thinking is that DoSomethingPossiblyFaster is actually faster than DoSomethingPossiblySlower.

I know this is pretty much micro-optimization, but it would be helpful to have a definitive answer.

Edit Just add some background around this. Our application needs to process a lot of data coming from telecommunication networks, and this method is likely to be called about 1 billion times a day for some of our servers. I believe that every little one helps, and sometimes all I try to do is give the compiler some advice.

+8
performance c #
source share
4 answers

If the / JIT compiler does not already do this or similar optimization for you (this is a big if), then DoSomethingPossiblyFaster should be faster than DoSomethingPossiblySlower . The best way to explain why it is to look at a rough translation of C # code to direct C.

When a non-static member function is called, a hidden pointer to this is passed to the function. You will have something like this: ignoring the sending of a virtual function, since it has nothing to do with the question (or, equivalently, for Manager sealed for simplicity):

 struct Manager { Constraint* mConstraints; int mLength; } void DoSomethingPossiblyFaster(Manager* this) { Constraint* constraints = this->mConstraints; int length = this->mLength; for (int i = 0; i < length; i++) { Constraint constraint = constraints[i]; // Do something with it } } void DoSomethingPossiblySlower() { for (int i = 0; i < this->mLength; i++) { Constraint constraint = (this->mConstraints)[i]; // Do something with it } } 

The difference is that in DoSomethingPossiblyFaster , mConstraints lives on the stack, and access requires only one layer of the pointer, since it has a fixed offset from the stack pointer. In DoSomethingPossiblySlower , if the compiler skips the optimization opportunity, there is an extra pointer link. The compiler must read a fixed offset from the stack pointer to this access, and then read a fixed offset from this to get mConstraints .

There are two possible optimizations that can negate this snapshot:

  • The compiler can do what you did manually and cache mConstraints on the stack.

  • The compiler can store this in a register, so it does not need to pop it from the stack at each iteration of the loop until dereferenced. This means that fetching mConstraints from this or from the stack is basically the same operation: the only dereferencing of a fixed offset from a pointer that is already in the register.

+4
source share

What is more readable? This should usually be your main motivating factor. Do you even need to use a for foreach instead of foreach ?

Since mConstraints is readonly , I could expect the JIT compiler to do this for you, but really, what are you doing in a loop? The chances of being significant are pretty small. I almost always chose the second approach just for reading - and I would prefer foreach where it was possible. Whether the JIT compiler optimizes this case will very much depend on the JIT itself, which can vary depending on versions, architectures, and even on how big the method or other factors are. There can be no β€œfinal” answer, since it is always possible that the alternative JIT will be optimized differently.

If you think you are in a corner where it is really important, you should compare it - carefully, with the most realistic data. Only then should you change your code away from the most readable form. If you write this code quite often, it seems unlikely that you are doing your best.

Even if the difference in readability is relatively small, I would say that it is still present and significant - while I, of course, expect that the difference in performance will be insignificant.

+16
source share

You know the answer you get, right? "Time is it."

There is probably no definitive answer. Firstly, the compiler can do the optimization for you. Secondly, even if it is not, indirect addressing at the assembly level may not be much slower. Thirdly, it depends on the cost of creating a local copy compared to the number of iterations of the loop. Then there are caching effects to consider.

I love to optimize, but this is one place that I will definitely wait until you have a problem, and then experiment. This is a possible optimization, which can be added if necessary, and not one of those optimizations that must be planned in advance to avoid the effect of massive ripple later.


Edit: (to final answer)

Compiling both functions in release mode and checking IL with IL Dasm shows that in both places the PossiblyFaster function uses a local variable, it has one more instruction
ldloc.0 vs
ldarg.0; ldfld class Constraint[] Manager::mConstraints

Of course, this is another layer removed from machine code - you don't know what the JIT compiler will do for you. But it is likely that "Probably Faster" will be a little faster.
However, I still do not recommend adding an additional variable until you are sure that this function is the most expensive on your system.

+3
source share

I profiled this and came up with a bunch of interesting results that are probably only applicable for my specific example, but I thought it was worth noting here.

The fastest release mode of the X86. This goes through one iteration of my test in 7.1 seconds, while the equivalent X64 code takes 8.6 seconds. This was done in 5 iterations, each iteration processed the loop 19.2 million times.

The fastest approach for the loop:

 foreach (var constraint in mConstraints) { ... do stuff ... } 

The second fastest approach, which surprised me very much, was as follows

 for (var i = 0; i < mConstraints.Length; i++) { var constraint = mConstraints[i]; ... do stuff ... } 

I guess this was because mConstraints was stored in register for the loop.

This slowed down when I removed the readonly parameter for mConstraints.

So my summary of this is that reading in this situation also gives performance.

+1
source share

All Articles