Efficiency C vs Assembler

How much faster is the following assembler code:

shl ax, 1 

Unlike the following C code:

 num = num * 2; 

How can I find out?

+4
source share
9 answers

Build option may be faster, may be slower. What made you think it was necessarily faster?

On the x86 platform , there are many ways to multiply something by 2. I expect the compiler to execute add ax, ax , which is intuitively more efficient than your shl , because it does not include a potentially stored constant ("1" in your case).

In addition, for a rather long time on the x86 platform, the preferred way to multiply objects by constants was not a shift, but rather a lea operation (when possible). In the above example, this would be lea eax, [eax*2] . (Multiplication by 3 will be done through lea eax, [eax*2+eax] )

Believing in shift operations is somehow β€œfaster” is a good old story for beginners, which today has little to do with it. And, as usual, in most cases your compiler (if updated) has much better knowledge of the underlying hardware platform than people with a naive love for shift operations.

+25
source

Is this, by chance, an academic question? I suppose you understand that this is in the general category of "getting a haircut to lose weight."

+8
source

If you are using GCC , ask to view the generated assembly with the -S option. You can find it in the same way as the assembler instruction.

To answer the original question, on Out-Of-Order processor processing speed is measured by bandwidth and latency, and you will measure how with rdtsc . But someone else did this for you for many processors, so you don't need to worry. Pdf

+5
source

In most cases, this will not change the situation. Multiplication occurs on almost all modern hardware. In particular, it is usually fast enough that if you do not have carefully hand-crafted code, the pipeline will hide all latency and you will not see the time difference between the two cases.

You may be able to measure performance differences in multiplication and shifts when you execute them in isolation, but usually there will be no difference in the context of the rest of your compiled code. (As I noticed, this may not be valid if the code is carefully optimized).

Now, given that shifts are still generally faster than multiplications, and almost any reasonable compiler will display a fixed power of twice a shift in any case (assuming that the semantics are actually equivalent to the target architecture).

Edit: another thing you can try if you really care about this is x+x . I know at least one architecture on which it can be faster than an offset, depending on the surrounding context.

+4
source

If you have a decent compiler, it will create the same or similar code. The best way is to parse and verify the generated code.

+3
source

The answer depends, as you can see here, in many things. What the compiler will do with your C code depends on many things. If we are talking about x86-32, this should be generally applicable.

At a basic level, your C code points to a memory variable that requires at least one command to be multiplied by two: "shl mem, 1", and in such a simple case, C code will be slower.

If num is a local variable, the compiler may decide to put it in the register (if it is used often enough and / or the function is small enough), and then you will have the instruction "shl reg, 1" - maybe.

Which instruction most quickly relates to how they are implemented in the processor. Shl may not be the best choice, as it affects the flags C and Z, which slow it down. A few years ago, the recommendation was "lea reg, [reg + reg]" (all registers are the same), because lea did not affect any flags, and there were options like (using the eax register on the x86-32 platform as an example) :

 lea eax,[eax+eax] ; *2 lea eax,[eax+eax*2] ; *3 lea eax,[eax+eax*4] ; *5 lea eax,[eax+eax*8] ; *9 

I don’t know what the norm is today, but probably your compiler.

As for measuring information retrieval here, in the rdtsc command, which is the best alternative for picking up the handset when it counts the actual clock cycles.

+3
source

Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite sync engine to find out how long each one takes.

Assembly must be done using the built-in assembly in the same C program as for pure test C. Otherwise, you are not comparing apples to apples.

By the way, I think you should add a third test:

 num <<= 1; 

The question is whether this does the same as the build version.

+1
source

If for your target platform a left shift is the fastest way to multiply a number by two, then the likelihood that your compiler will do this when compiling the code. Look at the disassembly to check

So for this one line, this is probably exactly the same speed. However, since you hardly have a function containing only one line, you may well find that the compiler will delay the shift until the value is used, or otherwise mix it with the surrounding code, making it less clear. A good optimizing compiler, as a rule, copes well with beating a poor middle handwritten assembly.

+1
source

If the modern compiler (vc9) does a really good job, it will outperform vc6 on a large scale, and it won’t happen, so I even prefer to use VC6 for some code that works faster than the compiled code in mingw with -O3 and VC9 with / Ox

0
source

All Articles