Does keyword limitation provide significant benefits in gcc / g ++?

Has anyone ever seen any numbers / analysis of whether using the restrict C / C ++ keyword in gcc / g ++ really provides any significant performance boost in reality (and not just in theory)?

I read various articles recommending / degrading its use, but I did not come across any real numbers, practically demonstrating the arguments of the parties.

EDIT

I know that restrict not officially part of C ++, but it is supported by some compilers, and I read an article by Christer Erickson, which strongly recommends using it.

+41
c ++ c gcc g ++ restrict-qualifier
Dec 27 '09 at 8:04
source share
5 answers

The keyword restriction matters.

In some situations, I have seen improvements in factor 2 or more (image processing). In most cases, the difference is not so big. About 10%.

Here is a small example illustrating the difference. As a test, I wrote a basic 4x4 vector * vector transformation. Please note that I need to force the function not to be inlined. Otherwise, GCC detects that there are no alias pointers in my control code, and the restriction will not affect the attachment.

I could move the conversion function to another file.

 #include <math.h> #ifdef USE_RESTRICT #else #define __restrict #endif void transform (float * __restrict dest, float * __restrict src, float * __restrict matrix, int n) __attribute__ ((noinline)); void transform (float * __restrict dest, float * __restrict src, float * __restrict matrix, int n) { int i; // simple transform loop. // written with aliasing in mind. dest, src and matrix // are potentially aliasing, so the compiler is forced to reload // the values of matrix and src for each iteration. for (i=0; i<n; i++) { dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + src[2] * matrix[2] + src[3] * matrix[3]; dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + src[2] * matrix[6] + src[3] * matrix[7]; dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + src[2] * matrix[10] + src[3] * matrix[11]; dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + src[2] * matrix[14] + src[3] * matrix[15]; src += 4; dest += 4; } } float srcdata[4*10000]; float dstdata[4*10000]; int main (int argc, char**args) { int i,j; float matrix[16]; // init all source-data, so we don't get NANs for (i=0; i<16; i++) matrix[i] = 1; for (i=0; i<4*10000; i++) srcdata[i] = i; // do a bunch of tests for benchmarking. for (j=0; j<10000; j++) transform (dstdata, srcdata, matrix, 10000); } 



Results: (on my 2 GHz Core Duo)

 nils@doofnase:~$ gcc -O3 test.c nils@doofnase:~$ time ./a.out real 0m2.517s user 0m2.516s sys 0m0.004s nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c nils@doofnase:~$ time ./a.out real 0m2.034s user 0m2.028s sys 0m0.000s 

The thumb is 20% faster in this system.

To show how architecture-dependent this is, I allowed the same code to run on the Cortex-A8 integrated CPU (adjusted the number of cycles because I don't want to wait so long):

 root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c root@beagleboard:~# time ./a.out real 0m 7.64s user 0m 7.62s sys 0m 0.00s root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c root@beagleboard:~# time ./a.out real 0m 7.00s user 0m 6.98s sys 0m 0.00s 

Here the difference is only 9% (the same compiler).

+43
Dec 27 '09 at 18:31
source share

The article Demystifying the bounding keyword refers to the article Why the programmer-specified Aliasing is a bad idea (pdf) which says that it does not help at all and provides measurements to support this.

+6
May 15 '11 at 20:18
source share

Does keyword limitation provide significant benefits in gcc / g ++?

It can reduce the number of instructions, as shown in the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

 void f(int *a, int *b, int *x) { *a += *x; *b += *x; } void fr(int *restrict a, int *restrict b, int *restrict x) { *a += *x; *b += *x; } 

Compile and decompile:

 gcc -g -std=c99 -O0 -c main.c objdump -S main.o 

With -O0 they coincide.

C- -O3 :

 void f(int *a, int *b, int *x) { *a += *x; 0: 8b 02 mov (%rdx),%eax 2: 01 07 add %eax,(%rdi) *b += *x; 4: 8b 02 mov (%rdx),%eax 6: 01 06 add %eax,(%rsi) void fr(int *restrict a, int *restrict b, int *restrict x) { *a += *x; 10: 8b 02 mov (%rdx),%eax 12: 01 07 add %eax,(%rdi) *b += *x; 14: 01 06 add %eax,(%rsi) 

For the uninitiated, the calling convention :

  • rdi = first parameter
  • rsi = second parameter
  • rdx = third parameter

Conclusion: 3 teams instead of 4 .

Of course, the instructions may have different delays , but this gives a good idea.

Why was the GCC able to optimize this?

The above code was taken from Wikipedia Example , which is very illuminating.

Pseudo assembly for f :

 load R1 โ† *x ; Load the value of x pointer load R2 โ† *a ; Load the value of a pointer add R2 += R1 ; Perform Addition set R2 โ†’ *a ; Update the value of a pointer ; Similarly for b, note that x is loaded twice, ; because a may be equal to x. load R1 โ† *x load R2 โ† *b add R2 += R1 set R2 โ†’ *b 

For fr :

 load R1 โ† *x load R2 โ† *a add R2 += R1 set R2 โ†’ *a ; Note that x is not reloaded, ; because the compiler knows it is unchanged ; load R1 โ† *x load R2 โ† *b add R2 += R1 set R2 โ†’ *b 

Is this really faster?

Ermmm ... not for this simple test:

 .text .global _start _start: mov $0x10000000, %rbx mov $x, %rdx mov $x, %rdi mov $x, %rsi loop: # START of interesting block mov (%rdx),%eax add %eax,(%rdi) mov (%rdx),%eax # Comment out this line. add %eax,(%rsi) # END ------------------------ dec %rbx cmp $0, %rbx jnz loop mov $60, %rax mov $0, %rdi syscall .data x: .int 0 

And then:

 as -o ao aS && ld ao && time ./a.out 

on Ubuntu 14.04 AMD64 Intel i5-3210M processor.

I admit that I still do not understand modern processors. Let me know if you:

  • found a flaw in my method
  • found a test case for assembler where it gets a lot faster
  • understand why there was no difference.
+4
Jun 14 '15 at 10:43
source share

I tested this C-Program. Without restrict it took 12.640 seconds to complete, restrict 12.516. Seems like this can save some time.

0
Dec 27 '09 at 9:33
source share

Note that C ++ compilers that allow the restrict keyword can ignore it. For example, here .

0
Dec 27 '09 at 9:42
source share



All Articles