> vs. > = causes a significant difference in performance

Question

> vs. > = causes a significant difference in performance

I just stumbled upon something. At first I thought that this might be the case of a branch prediction incorrectly, as if it were in this case , but I can’t explain why an incorrect branch prediction should cause this phenomenon. I implemented two versions of Bubble Sort in Java and performed some performance tests:

import java.util.Random; public class BubbleSortAnnomaly { public static void main(String... args) { final int ARRAY_SIZE = Integer.parseInt(args[0]); final int LIMIT = Integer.parseInt(args[1]); final int RUNS = Integer.parseInt(args[2]); int[] a = new int[ARRAY_SIZE]; int[] b = new int[ARRAY_SIZE]; Random r = new Random(); for (int run = 0; RUNS > run; ++run) { for (int i = 0; i < ARRAY_SIZE; i++) { a[i] = r.nextInt(LIMIT); b[i] = a[i]; } System.out.print("Sorting with sortA: "); long start = System.nanoTime(); int swaps = bubbleSortA(a); System.out.println( (System.nanoTime() - start) + " ns. " + "It used " + swaps + " swaps."); System.out.print("Sorting with sortB: "); start = System.nanoTime(); swaps = bubbleSortB(b); System.out.println( (System.nanoTime() - start) + " ns. " + "It used " + swaps + " swaps."); } } public static int bubbleSortA(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] > a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } public static int bubbleSortB(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] >= a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } private static void swap(int[] a, int j, int i) { int h = a[i]; a[i] = a[j]; a[j] = h; } }

As you can see, the only difference between the two sorting methods is > vs. >= . When you run the program with java BubbleSortAnnomaly 50000 10 10 you obviously expect sortB be slower than sortA . But I got the following (or similar) output on three different machines:

 Sorting with sortA: 4.214 seconds. It used 564960211 swaps. Sorting with sortB: 2.278 seconds. It used 1249750569 swaps. Sorting with sortA: 4.199 seconds. It used 563355818 swaps. Sorting with sortB: 2.254 seconds. It used 1249750348 swaps. Sorting with sortA: 4.189 seconds. It used 560825110 swaps. Sorting with sortB: 2.264 seconds. It used 1249749572 swaps. Sorting with sortA: 4.17 seconds. It used 561924561 swaps. Sorting with sortB: 2.256 seconds. It used 1249749766 swaps. Sorting with sortA: 4.198 seconds. It used 562613693 swaps. Sorting with sortB: 2.266 seconds. It used 1249749880 swaps. Sorting with sortA: 4.19 seconds. It used 561658723 swaps. Sorting with sortB: 2.281 seconds. It used 1249751070 swaps. Sorting with sortA: 4.193 seconds. It used 564986461 swaps. Sorting with sortB: 2.266 seconds. It used 1249749681 swaps. Sorting with sortA: 4.203 seconds. It used 562526980 swaps. Sorting with sortB: 2.27 seconds. It used 1249749609 swaps. Sorting with sortA: 4.176 seconds. It used 561070571 swaps. Sorting with sortB: 2.241 seconds. It used 1249749831 swaps. Sorting with sortA: 4.191 seconds. It used 559883210 swaps. Sorting with sortB: 2.257 seconds. It used 1249749371 swaps.

When you set the LIMIT parameter to, for example, 50000 ( java BubbleSortAnnomaly 50000 50000 10 ), you will get the expected results:

 Sorting with sortA: 3.983 seconds. It used 625941897 swaps. Sorting with sortB: 4.658 seconds. It used 789391382 swaps.

I ported a C ++ program to determine if this issue is Java specific. Here is the C ++ code.

 #include <cstdlib> #include <iostream> #include <omp.h> #ifndef ARRAY_SIZE #define ARRAY_SIZE 50000 #endif #ifndef LIMIT #define LIMIT 10 #endif #ifndef RUNS #define RUNS 10 #endif void swap(int * a, int i, int j) { int h = a[i]; a[i] = a[j]; a[j] = h; } int bubbleSortA(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; i > 0; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] > a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int bubbleSortB(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; i > 0; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] >= a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int main() { int * a = (int *) malloc(ARRAY_SIZE * sizeof(int)); int * b = (int *) malloc(ARRAY_SIZE * sizeof(int)); for (int run = 0; RUNS > run; ++run) { for (int idx = 0; idx < ARRAY_SIZE; ++idx) { a[idx] = std::rand() % LIMIT; b[idx] = a[idx]; } std::cout << "Sorting with sortA: "; double start = omp_get_wtime(); int swaps = bubbleSortA(a); std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps << " swaps." << std::endl; std::cout << "Sorting with sortB: "; start = omp_get_wtime(); swaps = bubbleSortB(b); std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps << " swaps." << std::endl; } free(a); free(b); return (0); }

This program shows the same behavior. Can someone explain what exactly is going on here?

Doing sortB first and then sortA will not change the results.

+71

java c ++ performance optimization

Turing85 May 13 '15 at 19:14

source share

4 answers

I think that this can really be explained by an incorrect industry prediction.

Consider, for example, LIMIT = 11 and sortB . At the first iteration of the outer loop, he very quickly stumbles upon one of the elements equal to 10. Thus, he will have a[j]=10 , and therefore definitely a[j] will be >=a[next] , since there are no elements, which are greater than 10. Therefore, he will do the swap, then take one step in j only to find a[j]=10 again (the same changed value). So once again it will be a[j]>=a[next] , and so one. Each comparison, except a few at the very beginning, will be true. In the same way, it will work on the following iterations of the outer loop.

Not the same for sortA . It will start in about the same way, stumble upon a[j]=10 , make some swaps in the same way, but only to the point where it finds a[next]=10 too. Then the condition will be false and no exchange will be made. So: every time it comes across a[next]=10 , the condition is false and no swaps are met. Therefore, this condition is true 10 times out of 11 (values a[next] from 0 to 9) and false in 1 case out of 11. It is not strange that the branch prediction fails.

+11

Petr May 13, '15 at 19:40

source share

Using the provided C ++ code (the countdown was deleted) using the perf stat command, I got results that confirm the brach-miss theory.

With Limit = 10 BubbleSortB greatly benefits from branch prediction (0.01% misses), but with the forecast of deviations Limit = 50000 it fails even more (with 15.65% skipped) than in BubbleSortA (12.69% and 12.76%, respectively).

BubbleSortA Limit = 10:

 Performance counter stats for './bubbleA.out': 46670.947364 task-clock # 0.998 CPUs utilized 73 context-switches # 0.000 M/sec 28 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 117,298,787,242 cycles # 2.513 GHz 117,471,719,598 instructions # 1.00 insns per cycle 25,104,504,912 branches # 537.904 M/sec 3,185,376,029 branch-misses # 12.69% of all branches 46.779031563 seconds time elapsed

BubbleSortA Limit = 50000:

 Performance counter stats for './bubbleA.out': 46023.785539 task-clock # 0.998 CPUs utilized 59 context-switches # 0.000 M/sec 8 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 118,261,821,200 cycles # 2.570 GHz 119,230,362,230 instructions # 1.01 insns per cycle 25,089,204,844 branches # 545.136 M/sec 3,200,514,556 branch-misses # 12.76% of all branches 46.126274884 seconds time elapsed

BubbleSortB Limit = 10:

 Performance counter stats for './bubbleB.out': 26091.323705 task-clock # 0.998 CPUs utilized 28 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 64,822,368,062 cycles # 2.484 GHz 137,780,774,165 instructions # 2.13 insns per cycle 25,052,329,633 branches # 960.179 M/sec 3,019,138 branch-misses # 0.01% of all branches 26.149447493 seconds time elapsed

BubbleSortB Limit = 50000:

 Performance counter stats for './bubbleB.out': 51644.210268 task-clock # 0.983 CPUs utilized 2,138 context-switches # 0.000 M/sec 69 CPU-migrations # 0.000 M/sec 378 page-faults # 0.000 M/sec 144,600,738,759 cycles # 2.800 GHz 124,273,104,207 instructions # 0.86 insns per cycle 25,104,320,436 branches # 486.101 M/sec 3,929,572,460 branch-misses # 15.65% of all branches 52.511233236 seconds time elapsed

+9

fala May 14 '15 at 7:38

source share

Edit 2: This answer is probably incorrect in most cases, below, when I say that all of the above is true, it is still true, but the bottom part is incorrect for most processor architectures, see comments. However, I will say that it is theoretically possible to have some kind of JVM for some OS / Architecture that do this, but the JVM is probably poorly implemented or it is a strange architecture. In addition, this is theoretically possible in the sense that most conceivable things are theoretically possible, so I would take the last portion with salt.

Firstly, I'm not sure about C ++, but I can talk about Java.

Here is the code

 public class Example { public static boolean less(final int a, final int b) { return a < b; } public static boolean lessOrEqual(final int a, final int b) { return a <= b; } }

Running javap -c on it, I get bytecode

 public class Example { public Example(); Code: 0: aload_0 1: invokespecial #8 // Method java/lang/Object."<init>":()V 4: return public static boolean less(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpge 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn public static boolean lessOrEqual(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpgt 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn }

You will notice that the only difference is if_icmpge (if comparing more / equal) compared to if_icmpgt (if comparing more).

Everything above is a fact, the rest is my best guess about how if_icmpge and if_icmpgt based on the college course I took in assembler. To get a better answer, you should look at how your JVMs handle this. I assume that C ++ also comes down to a similar operation.

Edit: documentation on if_i<cond> here

The way computers compare numbers, subtracts one from the other and checks whether this number is 0 or not, so when a < b , if it subtracts b from a and sees, if the result is less than 0, the sign of the value ( b - a < 0 ) Make a <= b , although he must take an additional step and subtract 1 ( b - a - 1 < 0 ).

Usually this is a very minor difference, but this is not some kind of code, this is confusion! O (n ^ 2) is the average number of times we make this particular comparison because it is in the inner majority of loops.

Yes, this may have something to do with branch prediction. I am not sure, I am not an expert in this, but I think that this can also play an insignificant role.

+3

Captain Man May 13 '15 at 20:15

source share

uesp · Accepted Answer · 2015-05-13 19:44

I think this may be due to branch prediction. If you count the number of swaps compared to the number of iterations of the internal sort that you find:

Limit = 10

A = 560M swaps / 1250M cycles
B = 1250M swaps / 1250M cycles (0.02% less swaps than cycles)

Limit = 50,000

A = 627M swaps / 1250M cycles
B = 850M swaps / 1250M cycles

Thus, in the case of Limit == 10 exchange is 99.98% of the time in the B-type, which is obviously beneficial for the branch predictor. In the case of Limit == 50000 exchange is only randomly performed at 68%, so the branch predictor is less profitable.

> vs. > = causes a significant difference in performance

More articles: