Optimization of the internal processor of the central processor

From http://www.boost.org/community/implementation_variations.html

"... differences in coding, such as changing the class from virtual to non-virtual participants or removing the level of indirection, are unlikely to make any measurable difference if deep in the inner loop. And even in the inner loop, modern processors often perform such competing code sequences in the same number of clock cycles! "

I am trying to understand the "even in the inner loop" part. In particular, what mechanisms do the CPU implement to execute two codes (virtual or non-virtual or an additional level of indirection) within the same number of clock cycles? I know about pipelining and instruction caching, but how can I make a virtual call within the same number of clock cycles as a non-virtual call? How is indirect “lost”?

+7
c ++ performance cpu-registers
source share
5 answers

Caching (e.g. target branch caching ), parallel load units (part of pipelining, but also things like "hit under miss" doesn't stop the pipeline), and out-of-order execution will probably help convert load - load - branch into something which is closer to a fixed branch . Folding / eliminating instructions (what is the rule for this?) At the stage of predicting decoding or branching of a pipeline may also help.

All this relies on many different things: how many different targets of the branch (for example, how many different virtual overloads you are likely to run), how many things you are going to (this is the branch cache server) like "icache / dcache?" ), how virtual tables or pointer tables are nested in memory (are they cacheable, or does each new vtable load possibly supplant the old vtable?), cache is invalidated many times due to multi-core ping pong, etc.

(Disclaimer: I am definitely not an expert here, and many of my knowledge is related to studying embedded processors, so some of them are extrapolation. If you have corrections, feel free to comment!)

The right way to determine if this will be a problem for a particular program, of course, for the profile. If possible, do it with the help of hardware counters - they can tell you a lot about what happens at different stages of the conveyor.


Edit:

As Hans Passant notes in the above commentary Modern optimizations for indexing internal processors , the key to these two things taking the same amount of time is the ability to effectively “fire” more than one instruction per cycle. This may help eliminate commands, but a superscalar design is probably more important (a miss hit is a very small and concrete example, completely redundant load units may be better).

Let's take an ideal situation and suppose that a direct branch is just one instruction:

 branch dest 

... and the indirect branch is three (maybe you can get it in half, but it is more than one):

 load vtable from this load dest from vtable branch dest 

Suppose that the situation is absolutely perfect: * This and the entire vtable are in L1 cache, L1 cache is fast enough to support a depreciated one cycle for each instruction cost for two loads. (You can even assume that the processor reordered the loads and mixed them with earlier instructions to give time to complete them before the branch, it does not matter for this example.) Also, assume that the branch cache responder is hot and there is no pipeline flash costs for the branch, and the transition instruction is reduced to one cycle (amortized).

The theoretical minimum time for the first example is therefore 1 cycle (amortized).

The theoretical minimum for the second example, the absence of a team exception or redundant functional units or something that will allow you to go to more than one team per cycle, is 3 cycles (there are 3 teams)!

Indirect loading will always be slower because there are more instructions until you achieve something like a superscalar design that allows you to go out for more than one instruction per cycle.

Once you have this, the minimum for both examples will be something between 0 and 1 cycles, again, if everything else is perfect. Probably, for the second example, you need to have more ideal conditions in order to actually achieve this theoretical minimum than for the first example, but now it is possible.

In some cases of which you would be interested, you probably will not reach this minimum for any example. Either the branch’s cache will be cold, or the vtable will not be in the data cache, or the machine will not be able to reorder the instructions to make full use of redundant function blocks.

... this is where profiling takes place, which is usually a good idea.

First of all, you can just cover up a little paranoia. See Noel Llopis's article on a data-driven project , an excellent Object Trap — Oriented slides for programming , and Mike Acton, rude but educational presentations , Now you suddenly switched to templates that are likely to be pleased with the processor if you process a lot of data.

High-level language functions, such as virtual ones, are usually a compromise between expressiveness and control. I honestly think that just by increasing your understanding of what virtual is actually doing (don't be afraid to read the disassembly view from time to time and, of course, look into your processor architecture manuals), you will use it when it makes sense, not when it’s not, and the profiler can cover everything else if necessary.

Claims of the same size for everyone “that you are not using virtual” or “virtual use are unlikely to measure the difference” make me grumpy. The reality, as a rule, is more complicated, and you will either find yourself in a situation where you are happy to profile or avoid, or you are in the other 95%, where you probably should not care, except for possible educational content.

+4
source share

Conveyorization is the main way.

It may take 20 clock cycles to load an instruction, decode it, perform its actions, and load indirect memory references. But due to pipleline, the processor can execute parts of 19 other instructions simultaneously at different stages of the pipeline, providing a total throughput of 1 instruction per cycle, regardless of how much time it actually takes to submit this instruction through the pipeline.

+4
source share

What happens, I think the processor has a special cache that contains the locations and goals of branches and indirect jumps. If an indirect jump occurs at the level of $ 12345678, and for the last time it ran into the address $ 12348765, the processor can start speculative execution of instructions at the address $ 12348765 even before it resolves the branch address. In many cases, in the inner loop of a function, a specific indirect jump will always go to the same address throughout the entire loop. Thus, the indirect transition cache can avoid penalties.

+1
source share

Modern processors use the adaptive branch prediction method, which can predict many indirect jumps, such as the virtual vtable function you get. See http://en.wikipedia.org/wiki/Branch_prediction#Prediction_of_indirect_jumps

+1
source share

If the CPU already has a memory address in the cache, then the execution of the load command is trivial if it is.

0
source share

All Articles