Break one cycle into two loops

Good day,

Suppose you have a simple loop as shown below ...

for(int i=0;i<10;i++) { //statement 1 //statement 2 } 

Suppose statement 1 and statement 2 were O (1). In addition, will the small overhead of “starting” another cycle destroy the fact that a cycle for two (not nested but sequential) cycles will be just as fast? For example...

 for(int i=0;i<10;i++) { //statement 1 } for(int i=0;i<10;i++) { //statement 2 } 

Why am I asking such a stupid question that I have a collision detection system (CDS) that should cross all objects. I want to "split" the functionality of my CDS system, so I can just call

 cds.update(objectlist); 

instead of breaking the cds system. (Don’t worry about my implementation of CDS ... I think I know what I'm doing, I just don’t know how to explain it, what I really need to know if I take huge success in a loop through all my objects again.

+7
source share
6 answers

It depends on your application.

Possible disadvantages (splitting):

  • your data does not fit into the L1 data cache, so you load it once for the first cycle, and then reload it for the second cycle

Possible winnings (splits):

  • your loop contains many variables, splitting helps reduce the pressure in the register / glass, and the optimizer turns it into better machine code.
  • the functions you use unload the L1 instruction cache, so the cache is loaded at each iteration, and by splitting you control how it is loaded once (only) at the first iteration of each cycle

These lists, of course, are not exhaustive, but you can already feel that tension exists between the code and the data. So it’s hard for us to make an educated / wild guess when we don’t know either one.

In doubt: profile. Use callgrind, check cache misses in each case, check the number of instructions executed. Measure the time spent.

+2
source

In terms of algorithmic complexity, splitting cycles does not matter.

In terms of performance in the real world, loops can improve performance, degrade performance, or make no difference — it depends on the OS, hardware, and, of course, statement 1 and statement 2 .

+3
source

With two cycles, you will pay for:

  • increased size of generated code
  • 2x, since many branches are predicted
  • depending on what the data structure is for operators 1 and 2, you can reload the data into the cache.

The last point can have a huge impact in any direction. You must measure, as with any optimization.

+2
source

As for the big-o complexity, it doesn't make any difference, if 1 loop is O (n), then this is also a solution to loop 2.
As for micro-optimization, it's hard to say. The cost of the cycle is quite small, we do not know what the costs of accessing your objects (if they are in the vector, this should also be quite small), but there is a lot to consider the possibility of giving useful information an answer.

+1
source

As already noted, the complexity remains.

But in the real world, we cannot predict which version is faster. The following are factors that play a huge role:

  • data caching
  • Instruction caching
  • Speculative execution
  • Branch Prediction
  • Disable target buffers
  • The number of available registers on the processor
  • Cache size

(note: above all of them there is a sword of Damocles of incorrect prediction, they are all wikipedizable and googlable)

In particular, the last factor sometimes does not allow compiling one true code for a code whose performance depends on the specific cache size. Some applications will run faster on a processor with huge caches, and will work slower on small caches, but for some other applications this will be the other way around.

Solutions:

  • Let your compiler do the loop conversion. Modern g ++ are pretty good at this discipline. Another discipline in which g ++ is good is automatic vectorization. Keep in mind that compilers know more about computer architecture than almost all people.
  • Send different binaries and dispatcher.
  • Use cached forgotten data structures / layouts and algorithms that adapt to the target cache.

It is always useful to put effort into software that adapts to the goal, ideally without sacrificing code quality. And before performing manual optimization, either microscopic or macroscopic, measure the real worlds, then and only then optimize.

References: * Agner Fog Guides * Intel Guides

+1
source

You correctly noticed that performance overhead can occur by creating a second cycle. Therefore, it cannot be “equally fast”; since these overheads, although small, are still overheads.

I will not try to talk intelligently about how to create collision systems, but if you are trying to optimize performance, it is better to avoid creating unnecessary control structures if you can control it without pulling your hair out.

Remember that premature optimization is one of the worst things you can do. Worry about optimization if you have performance issues in my opinion.

0
source

All Articles