(n - multiplication) vs (n / 2 - multiplication + 2 additions), which is better?

Question

(n - multiplication) vs (n / 2 - multiplication + 2 additions), which is better?

I have a C program that has n multiplications (one-time multiplication with n iterations), and I found another logic that has n / 2 iterations (1 multiplication + 2 additions). I know about the complexity of both of O (n). but in terms of CPU cycles. which is faster?

+4

performance optimization c implementation

Tejas joshi Aug 15 '11 at 23:50

source share

2 answers

Test on your computer. Or look at the specifications of your processor and guess.

The old logic is no longer applied: on modern processors, integer multiplication can be very cheap, on some new Intel processors - 3 clock cycles. Additions - 1 cycle on the same processors. However, in a modern pipeline processor, racks created by data dependencies can lead to longer execution times.

My guess is that N + + N / 2 multiplications are slower than N multiplications if you are performing a bend type operation, and I would prefer the opposite for a map type operation. But this is only an assumption.

Check if you want the truth.

However: Most algorithms that are simple are memory related, and both will have the same speed.

+3

Dietrich epp Aug 16 '11 at 0:25

source share

Voo · Accepted Answer · 2011-08-16T00:51:11+0000

First of all, follow the first tip of Dietrich Epp - measuring (at least for complex optimization problems) is the only way to make sure.

Now, if you want to understand why one is faster than the other, we can try. There are two different important performance indicators: Latency and cross-bandwidth. Summary of two:

Delay: This is the delay that the team generates in the dependency chain. Numbers are the minimum values. Cache skips, misalignment and exceptions can increase the number of hours significantly. If hyperthreading is enabled, using the same actuators in a different thread will result in reduced performance. Denormal numbers, NAN, and infinity do not increase latency. a single clock cycle is used, not a clock cycle specified by a time stamp counter.
Mutual bandwidth: The average number of kernel clock cycles is an instruction for a number of independent instructions of the same kind in the same topic.

For the sand bridge rec. the throughput for add r, r/i (for further notification r = register, i = immediate, m = memory) is 0.33, and the latency is 1.

An imul r, r has latency 3 and rec. bandwidth 1.

So, as you can see, it completely depends on your particular algorithm - if you can just replace one imul with two independent add-ons, this part of your algorithm can get a theoretical acceleration of 50% (and in the best case, obviously, an acceleration of ~ 350%). But on the other hand, if your additions add a problematic dependency, one imul can be as fast as one.

Also note that we have ignored all the additional complexities, such as memory and cache behavior (things that tend to have a lot more impact on runtime) or complex things like μop-fusion and much more. In general, the only people who need to take care of this are compiler writers - it’s much easier to just measure the result of your efforts;)

In any case, if you want to get a good list of this material, see here (the above description of the bandwidth latency / rec. Document).

(n - multiplication) vs (n / 2 - multiplication + 2 additions), which is better?

More articles: