C ++ Performance Problem - Array Search

I have two versions of searching an int array for a specific value.

The first version is straightforward

int FindWithoutBlock(int * Arr, int ArrLen, int Val) { for ( int i = 0; i < ArrLen; i++ ) if ( Arr[i] == Val ) return i; return ArrLen; } 

The second version should be faster. The past array must be one element larger than in the previous case. Say for an array with 5 values, you select six integers, and then do the following

 int FindWithBlock(int * Arr, int LastCellIndex, int Val) { Arr[LastCellIndex] = Val; int i; for ( i = 0 ; Arr[i] != Val; i++ ); return i; } 

this version should be faster - you do not need to check the boundaries of the array with each iteration through Arr.

Now the "problem". When you run these functions 100K times on an array of 100K elements in Debug, the second version is about 2 times faster. In the version, however, the first version is approximately 6,000 times faster. And the question is why.

A program that demonstrates this is located at http://eubie.sweb.cz/main.cpp

Any insight is greatly appreciated. Daniel

+8
c ++ performance
source share
7 answers

Here are my results using DevStudio 2005:

Debug:

  • Without block: 25.109
  • With block: 19.703

Release:

  • Without block: 0
  • With block: 6.046

It is very important to run this from the command line, and not from DevStudio, DevStudio something affects the performance of the application.

The only way to find out what is actually happening is to look at the assembler code. Here the assembler is generated in the release: -

 FindWithoutBlock: 00401000 xor eax,eax 00401002 cmp dword ptr [ecx+eax*4],0F4240h 00401009 je FindWithoutBlock+1Ah (40101Ah) 0040100B add eax,1 0040100E cmp eax,186A0h 00401013 jl FindWithoutBlock+2 (401002h) 00401015 mov eax,186A0h 0040101A ret 

Note that the compiler removed the ArrLen parameter and replaced it with a constant! He also saved it as a function.

Here is what the compiler did with another function (FindWithBlock): -

 004010E0 mov dword ptr [esp+38h],186A0h 004010E8 mov ebx,0F4240h 004010ED mov dword ptr [esi+61A80h],ebx 004010F3 xor eax,eax 004010F5 cmp dword ptr [esi],ebx 004010F7 je main+0EFh (40110Fh) 004010F9 lea esp,[esp] 00401100 add eax,1 00401103 cmp dword ptr [esi+eax*4],ebx 00401106 jne main+0E0h (401100h) 00401108 cmp eax,186A0h 0040110D je main+0F5h (401115h) 0040110F call dword ptr [__imp__getchar (4020D0h)] 00401115 sub dword ptr [esp+38h],1 0040111A jne main+0CDh (4010EDh) 

Here the function has been inserted. lea esp,[esp] - Only 7 bytes to align the next command. The code checks index 0 separately for all other indexes, but the main loop is more definite than the version of FindWithoutBlock.

Hmmm. Here is the code that calls FindWithoutBlock: -

 0040106F mov ecx,edi 00401071 mov ebx,eax 00401073 call FindWithoutBlock (401000h) 00401078 mov ebp,eax 0040107A mov edi,186A0h 0040107F cmp ebp,186A0h 00401085 je main+6Dh (40108Dh) 00401087 call dword ptr [__imp__getchar (4020D0h)] 0040108D sub edi,1 00401090 jne main+5Fh (40107Fh) 

Yeah! The FindWitoutBlock function is called only once! The compiler noticed that the function will return the same value each time and optimized it for one call. In FindWithBlock, the compiler cannot make the same assumption because you are writing an array before the search, so the array is (potentially) different for each call.

To verify this, add the volatile keyword as follows: -

 int FindWithoutBlock(volatile int * Arr, int ArrLen, int Val) { for ( int i = 0; i < ArrLen; i++ ) if ( Arr[i] == Val ) return i; return ArrLen; } int FindWithBlock(volatile int * Arr, int LastCellIndex, int Val) { Arr[LastCellIndex] = Val; int i; for ( i = 0 ; Arr[i] != Val; i++ ); return i; } 

At the same time, both versions work at the same time (6.040). Observing that memory access is the main bottleneck, more complex FindWithoutBlock tests do not affect overall speed.

+8
source share

First, drop the disgusting C-trash. std::find and iterators?

But secondly, the compiler optimizer is written to recognize the first form, not the second. This can be, for example, expanded, expanded, or vectorized, while the second cannot.

In general, consider a cache problem. You touch the end of the array, and then going to the beginning - it could be a cache miss. However, in the first block, you play fun only sequentially through an array, and the cache is coherent.

+2
source share

This is more of an extended comment than an answer. Skizz has already answered the question: "Aha! The FindWithoutBlock function is called only once!"

Test driver
Usually I usually put the code for the test driver and test article in separate files. First, you are not going to deliver a test driver. On the other hand, combining them, like you, allows the optimizer to do what you really do not want to do, for example, to call a function once, and not 100,000 times. Separating them allows you to use different optimization levels for the driver and test article. I try to compile a driver that is not optimized, so a loop that does the same thing 100K times will actually execute 100K times. On the other hand, the test article is compiled with the optimization expected for release.

Using getchar ()
It is usually a bad idea to use any I / O operations inside the test loop when testing CPU usage. Your test code calls getchar when the element to be found is not in the array. [An unsuccessful analysis remained.] Update: your test code calls getchar when the element to be found is in the array. Although your test code ensures that the element will not be found (and therefore getchar will not be called), it is still not recommended to have this call. Do something quick and benign.

C compared to C ++
Your code is more like C ± rather than C ++. You are using malloc , not new , you are mixing C and C ++ I / O, and you are not using a C ++ library such as std::find . This is typical for those moving from C to C ++. Good to know what std::find . This allows you to completely eliminate your FindWithoutBlock function.

Premature optimization
The only reason to use this FindWithBlock wording is that this search is a bottleneck. So is this really a bottleneck? The representation of FindWithoutBlock (and even better, std::find ) is probably better because you do not need to modify the array, and therefore the array argument can be marked as const . An array cannot be marked as such by FindWithBlock because you are FindWithBlock array.

+2
source share

I observe that in the first case, the compiler knows at run time the size of the loop (for example, <ArrLen). In the second case, the compiler cannot know.

0
source share

In the first example, at each iteration, two conditions are checked: i < ArrLen and Arr[i] == Val . In the second example, there is only one condition to check. Therefore, the first cycle is twice as slow.

I cannot observe the same behavior with GCC: the first cycle is even slower.

C -O0 :

 Without block: 25.83 With block: 20.35 

C- -O3 :

 Without block: 6.33 With block: 4.75 

I assume that the compiler somehow understood that there is no SearchVal in the array, and therefore there is no reason to call the function that searches for it.

0
source share

The first for .. loop contains two conditions for each iteration, while the second for loop contains one iteration for each loop. For a large number of iterations, this difference should be displayed, because there is a RAW relationship between the second condition and the iterator increment. But I still do not think that the acceleration should be so high.

0
source share

Your compiler is smart.

If you use the LLVM Try Out page, you will receive the following IR generated:

 define i32 @FindWithoutBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable readonly define i32 @FindWithBlock(i32* nocapture %Arr, i32 %ArrLen, i32 %Val) nounwind uwtable 

The only difference is the readonly attribute for the first function:

From the Language Reference :

only for reading

This attribute indicates that the function does not write any pointer arguments (including byval arguments) or otherwise changes any state (for example, memory, control registers, etc.) that is visible to caller functions. It can dereference pointer arguments and the read state that can be set in the caller. The readonly function always returns the same value (or unwinds the exception identically) when called with the same set of arguments and global state. It cannot disable exception by calling C ++ exception exception methods.

This means that perhaps the optimizer can understand that the function will always return the same calculation (for a given loop) and pull it out of the loop.

0
source share

All Articles