Caching (e.g. target branch caching ), parallel load units (part of pipelining, but also things like "hit under miss" doesn't stop the pipeline), and out-of-order execution will probably help convert load - load - branch into something which is closer to a fixed branch . Folding / eliminating instructions (what is the rule for this?) At the stage of predicting decoding or branching of a pipeline may also help.
All this relies on many different things: how many different targets of the branch (for example, how many different virtual overloads you are likely to run), how many things you are going to (this is the branch cache server) like "icache / dcache?" ), how virtual tables or pointer tables are nested in memory (are they cacheable, or does each new vtable load possibly supplant the old vtable?), cache is invalidated many times due to multi-core ping pong, etc.
(Disclaimer: I am definitely not an expert here, and many of my knowledge is related to studying embedded processors, so some of them are extrapolation. If you have corrections, feel free to comment!)
The right way to determine if this will be a problem for a particular program, of course, for the profile. If possible, do it with the help of hardware counters - they can tell you a lot about what happens at different stages of the conveyor.
Edit:
As Hans Passant notes in the above commentary Modern optimizations for indexing internal processors , the key to these two things taking the same amount of time is the ability to effectively “fire” more than one instruction per cycle. This may help eliminate commands, but a superscalar design is probably more important (a miss hit is a very small and concrete example, completely redundant load units may be better).
Let's take an ideal situation and suppose that a direct branch is just one instruction:
branch dest
... and the indirect branch is three (maybe you can get it in half, but it is more than one):
load vtable from this load dest from vtable branch dest
Suppose that the situation is absolutely perfect: * This and the entire vtable are in L1 cache, L1 cache is fast enough to support a depreciated one cycle for each instruction cost for two loads. (You can even assume that the processor reordered the loads and mixed them with earlier instructions to give time to complete them before the branch, it does not matter for this example.) Also, assume that the branch cache responder is hot and there is no pipeline flash costs for the branch, and the transition instruction is reduced to one cycle (amortized).
The theoretical minimum time for the first example is therefore 1 cycle (amortized).
The theoretical minimum for the second example, the absence of a team exception or redundant functional units or something that will allow you to go to more than one team per cycle, is 3 cycles (there are 3 teams)!
Indirect loading will always be slower because there are more instructions until you achieve something like a superscalar design that allows you to go out for more than one instruction per cycle.
Once you have this, the minimum for both examples will be something between 0 and 1 cycles, again, if everything else is perfect. Probably, for the second example, you need to have more ideal conditions in order to actually achieve this theoretical minimum than for the first example, but now it is possible.
In some cases of which you would be interested, you probably will not reach this minimum for any example. Either the branch’s cache will be cold, or the vtable will not be in the data cache, or the machine will not be able to reorder the instructions to make full use of redundant function blocks.
... this is where profiling takes place, which is usually a good idea.
First of all, you can just cover up a little paranoia. See Noel Llopis's article on a data-driven project , an excellent Object Trap — Oriented slides for programming , and Mike Acton, rude but educational presentations , Now you suddenly switched to templates that are likely to be pleased with the processor if you process a lot of data.
High-level language functions, such as virtual ones, are usually a compromise between expressiveness and control. I honestly think that just by increasing your understanding of what virtual is actually doing (don't be afraid to read the disassembly view from time to time and, of course, look into your processor architecture manuals), you will use it when it makes sense, not when it’s not, and the profiler can cover everything else if necessary.
Claims of the same size for everyone “that you are not using virtual” or “virtual use are unlikely to measure the difference” make me grumpy. The reality, as a rule, is more complicated, and you will either find yourself in a situation where you are happy to profile or avoid, or you are in the other 95%, where you probably should not care, except for possible educational content.