As already noted, the complexity remains.
But in the real world, we cannot predict which version is faster. The following are factors that play a huge role:
- data caching
- Instruction caching
- Speculative execution
- Branch Prediction
- Disable target buffers
- The number of available registers on the processor
- Cache size
(note: above all of them there is a sword of Damocles of incorrect prediction, they are all wikipedizable and googlable)
In particular, the last factor sometimes does not allow compiling one true code for a code whose performance depends on the specific cache size. Some applications will run faster on a processor with huge caches, and will work slower on small caches, but for some other applications this will be the other way around.
Solutions:
- Let your compiler do the loop conversion. Modern g ++ are pretty good at this discipline. Another discipline in which g ++ is good is automatic vectorization. Keep in mind that compilers know more about computer architecture than almost all people.
- Send different binaries and dispatcher.
- Use cached forgotten data structures / layouts and algorithms that adapt to the target cache.
It is always useful to put effort into software that adapts to the goal, ideally without sacrificing code quality. And before performing manual optimization, either microscopic or macroscopic, measure the real worlds, then and only then optimize.
References: * Agner Fog Guides * Intel Guides
Sebastian mach
source share