First of all, it is possible that some of the calculations that really belong to divss are charged with later instructions called "skid" . (Also see the rest of this comment stream for more details.) Xcode is supposedly similar to Linux perf and uses the cpu_clk_unhalted.thread fixed counter for cycles instead of one of the programmable counters. This is not an "exact" event (PEBS), so slippage is possible. As @BeeOnRope points out , you can use the PEBS event, which indicates once per loop (for example, UOPS_RETIRED < 16 ) as a replacement for PEBS for a fixed-loop counter, removing some of the dependency on interrupt behavior.
But the way that counters mostly work for pipelined / extraordinary executions also explains most of what you see. Or it could be; you didn’t show the full cycle so that we could not simulate code on a simple pipeline model, for example, IACA, or manually using hardware guides such as http://agner.org/optimize/ and the Intel Optimization Guide. (And you didn’t even indicate which microarchitecture you have. I assume that this is some member of the Intel Sandybridge family on the Mac).
Calculations for cycles are usually charged to a team awaiting a result , usually it is not a team that slows down a result. Consolidated processors do not stop until you try to read a result that is not yet ready.
Out-of-order execution massively complicates this, but it is still generally true when there is a very slow instruction, such as loading, which is often skipped in the cache. When the cycles counter overflows (causing an interrupt), there are many instructions in flight, but only RIP can be associated with this performance counter event. This is also a RIP where execution will resume after an interrupt.
So what happens when an interrupt occurs? See Andy Glew 's answer for this, which explains the interiors of punch interrupts in the Intel P6 microarchitecture and why (before PEBS) they always lingered. The Sandybridge family is similar to the P6 for this.
I think that a reasonable mental model for interrupts with primary counters on Intel processors is that it discards any hits that have not yet been sent to the execution unit. But the ALU uops that were sent already go through the pipeline until retirement (if there are no younger uops that were dropped) instead of interrupting, which makes sense since the maximum extra latency is ~ 16 cycles for sqrtpd , and flushing the store line may take more time. (Deferred stores that have already resigned cannot be thrown back). IDK for goods / stores that have not retired; at least loads are probably discarded.
I base this assumption on the fact that it's easy to create loops that don't show any counts for divss when the processor sometimes waits for it to give its outputs. If he had been discarded without retirement, this would be the next instruction when resuming the interruption, so (except for discounts) you will see a lot of calculations for him.
Thus, the distribution of the cycle counters shows which instructions spend the most on being the oldest, not yet sent instruction in the scheduler . (Or in the case of the front stalls, which CPU instructions are stalled trying to get / decode / release). Remember that this usually means that it shows you instructions that are waiting for input, and not instructions that slowly produce them.
(Hmm, this may not be so right , and I haven’t tested very much. I usually use perf stat to view the total number for the whole cycle in the microobject, and not in statistical profiles with perf record . addss and mulss are a longer delay than andps , so you expect andps get the number of pending xmm5 input if my proposed model is right.)
In any case, the general problem: with several flight instructions at a time, and which HW “blames” when the fairing cycles around?
Please note that divss is slower than the result, but it is just a single-instruction (unlike whole div that are microcoded on AMD and Intel). If you are not narrowband in terms of latency or its non-full-bandwidth bandwidth, it is no slower than mulss because it can also overlap with the surrounding code.
( divss / divps not completely pipelined. For example, on Haswell independent divps can be run every 7 cycles, but it takes only 10-13 cycles to get its result. fully pipelined, able to start a new operation on independent data in each cycle.)
Consider a large loop that complicates throughput rather than the latency of any dependencies associated with the loop, and only divss to run once per 20 FP instructions. Using divss constants instead of mulss with an inverse constant should (practically) have no difference in performance. (In practice, non-standard planning is not perfect, and longer networks of dependencies damage some, even if they are not connected with a loop, because they need more flight instructions to hide all this delay and maintain maximum throughput, i.e., For output to determine the level of parallelism instruction.)
In any case, the point here is that divss is the only uop, and it makes sense not to get a lot of counters for the cycles event, depending on the surrounding code.
You see the same effect when loading the cache: loading itself basically only gets counts if it needs to wait for registers in addressing mode, and the first command in the dependency chain that uses the downloaded data gets a lot of counts.
What might affect your profile :
divss does not need to wait until its inputs are ready. ( movaps %xmm3, %xmm5 to divss sometimes takes several loops, but divss never does.)
We can get closer to divss bandwidth divss
A dependency chain including xmm5 after divss gets some values. Out of turn execution should work in order to simultaneously keep several independent iterations in flight.
The dependency chain between maxss / movaps can be a significant bottleneck. (Especially if you are on Skylake, where the divss bandwidth is one per 3 cycles, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)
High counts for movaps can be associated with it after maxss , forming the only loop-dependent dependency in that part of the loop that you are showing. So it is plausible that maxss really slowly producing results. But if it really was a whole chain of segments, which was the main bottleneck, you would expect to see a lot of ratings on maxss , since it will wait for its input from the last iteration.
But perhaps mov-exception is "special", and for some reason all the counts are charged to movaps ? On Ivybridge processors and later versions, register copies do not need an executive module, but are instead processed at the pipeline release / rename stage .