What is the meaning of data32 data32 nopw% cs: 0x0 (% rax,% rax, 1) instruction in gcc inline asm?

Question

What is the meaning of data32 data32 nopw% cs: 0x0 (% rax,% rax, 1) instruction in gcc inline asm?

While executing some tests for -O2 optimization of gcc compilers, I observed the following instruction in disassembled code for a function:

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

What does this instruction do?

To be more detailed, I tried to understand how the compiler optimizes useless recursions, such as below, with O2 optimization:

 int foo(void) { return foo(); } int main (void) { return foo(); }

The above code causes a stack overflow during compilation without optimization, but works for optimized O2 code.

I think that with O2 he completely removed the foo function stack push, but why do data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) need data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) ?

 0000000000400480 <foo>: foo(): 400480: eb fe jmp 400480 <foo> 400482: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400489: 1f 84 00 00 00 00 00 0000000000400490 <main>: main(): 400490: eb fe jmp 400490 <main>

+7

optimization c assembly gcc x86

cmidi Apr 25 '15 at 23:36

source share

3 answers

To answer the question in the title, the instruction

 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

This is a 14-byte NOP command (no operation), which is used to fill in the gap between the foo function and the main function to maintain 16-byte alignment.

The x86 architecture has a large number of different NOP instructions of various sizes that can be used to insert add-ons into the executable segment, so they will have no effect if the CPU finishes executing on them. The Intel Optimization Guide then provides information on the recommended NOP encoding for different lengths, which can be used as a supplement.

In this particular case, it does not matter at all, since the NOP will never be executed (or even decoded, as happens after an unconditional jump), so the compiler can use any random garbage that it wanted.

+6

Chris dodd Apr 26 '15 at 1:57

source share

Functions foo () - infinite recursion without completion. Without optimization, gcc generates regular routine calls, which include at least the return address stacking. Since the stack is limited, this will create a stack overflow that will be _undefined_behaviour _.

If optimized, gcc detects that foo () does not require a stack frame at all (no arguments or local variables). It also detects that foo () instantly returns to the caller (which will also be foo ()). This is called tail-chaining: a function call at the end of a function (i.e., an explicit / implicit return) is converted to a transition to that function, so there is no need for a stack.

This behavior is undefined, but this time nothing bad is observed.

Just remenber: undefined includes fatal behavior as well as expected behavior (but this is random). Code that behaves differently with different levels of optimization should always be erroneous. There is one exception: Dates. It is not subject to the C language standard (none of most other languages).

As pointed out by others, data32 ... very precisely complements to get 16-byte alignment, which may be the size of the internal command bus and / or cache lines.

+3

Olaf Apr 26 '15 at 1:00

source share

peterh · Accepted Answer · 2015-04-25T23:59:32+0000

You see operand redirection cpu pipeline optimization.

Although this is an empty loop, gcc is also trying to optimize this :-).

The processor you are using has superscalar . This means that it has a pipeline, and different stages of execution of execution commands are executed in parallel. For example, if there is

 mov eax, ebx ;(#1) mov ecx, edx ;(#2)

then the loading and decoding of command # 2 may occur already when executing # 1.

Pipelining has serious problems to solve in the case of branches, even if they are unconditional.

For example, while jmp is decryption, the following instruction is already preloaded into the pipeline. But jmp changes the location of the next command. In such cases, the pipeline must be emptied and refilled, and many decent processor cycles will be lost.

It looks like this empty loop will work faster if in this case the pipeline is filled with no-op, despite the fact that it will never be executed. This is actually an optimization of some unusual x86 pipeline function.

Previously, dec alphas could even segfault from such things, and empty loops had to have many no-ops in them. x86 will only be slower. This is because they must be compatible with the Intel 8086.

Here you can read a lot from processing branch instructions in pipelines.

What is the meaning of data32 data32 nopw% cs: 0x0 (% rax,% rax, 1) instruction in gcc inline asm?

More articles: