First of all, follow the first tip of Dietrich Epp - measuring (at least for complex optimization problems) is the only way to make sure.
Now, if you want to understand why one is faster than the other, we can try. There are two different important performance indicators: Latency and cross-bandwidth. Summary of two:
Delay: This is the delay that the team generates in the dependency chain. Numbers are the minimum values. Cache skips, misalignment and exceptions can increase the number of hours significantly. If hyperthreading is enabled, using the same actuators in a different thread will result in reduced performance. Denormal numbers, NAN, and infinity do not increase latency. a single clock cycle is used, not a clock cycle specified by a time stamp counter.
Mutual bandwidth: The average number of kernel clock cycles is an instruction for a number of independent instructions of the same kind in the same topic.
For the sand bridge rec. the throughput for add r, r/i (for further notification r = register, i = immediate, m = memory) is 0.33, and the latency is 1.
An imul r, r has latency 3 and rec. bandwidth 1.
So, as you can see, it completely depends on your particular algorithm - if you can just replace one imul with two independent add-ons, this part of your algorithm can get a theoretical acceleration of 50% (and in the best case, obviously, an acceleration of ~ 350%). But on the other hand, if your additions add a problematic dependency, one imul can be as fast as one.
Also note that we have ignored all the additional complexities, such as memory and cache behavior (things that tend to have a lot more impact on runtime) or complex things like ΞΌop-fusion and much more. In general, the only people who need to take care of this are compiler writers - itβs much easier to just measure the result of your efforts;)
In any case, if you want to get a good list of this material, see here (the above description of the bandwidth latency / rec. Document).