This JMH test is not compatible with machines - why?

I am trying to write a method like this:

static boolean fitsInDouble(long x) { // return true if x can be represented // as a numerically-equivalent double } 

And I'm trying to find the most efficient implementation. I settled on one, but then a colleague ran tests and got different relative results. The fastest implementation for me is not the fastest for him.

Is there something wrong with these tests?

 package rnd; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Fork; import org.openjdk.jmh.annotations.Measurement; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.State; import org.openjdk.jmh.annotations.Warmup; import org.openjdk.jmh.infra.Blackhole; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; import java.math.BigDecimal; import java.util.concurrent.TimeUnit; @State(Scope.Thread) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(1) @Measurement(iterations = 5) @Warmup(iterations = 5) public class Benchmarks { public static void main(String[] args) throws Exception { Options options = new OptionsBuilder() .include(Benchmarks.class.getName()) .build(); new Runner(options).run(); } @Benchmark public void bigDecimal(Blackhole bh) { for (long x : NUMBERS) bh.consume(bigDecimal(x)); } @Benchmark public void cast(Blackhole bh) { for (long x : NUMBERS) bh.consume(cast(x)); } @Benchmark public void zeros(Blackhole bh) { for (long x : NUMBERS) bh.consume(zeros(x)); } public static boolean bigDecimal(long x) { BigDecimal a = new BigDecimal(x); BigDecimal b = new BigDecimal((double) x); return a.compareTo(b) == 0; } public static boolean cast(long x) { return x == (long) (double) x && x != Long.MAX_VALUE; } public static boolean zeros(long x) { long a = Math.abs(x); int z = Long.numberOfLeadingZeros(a); return z > 10 || Long.numberOfTrailingZeros(a) > 10 - z; } private static final long[] NUMBERS = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, 123, 456, 789, -123, -456, -789, 101112, 131415, 161718, -101112, -131415, -161718, 11L, 222L, 3333L, 44444L, 555555L, 6666666L, 77777777L, 888888888L, 9999999999L, 1111L, 22222L, 333333L, 4444444L, 55555555L, 666666666L, 7777777777L, 88888888888L, 999999999999L, 11111111, 222222222, 3333333333L, 44444444444L, 555555555555L, 6666666666666L, 77777777777777L, 888888888888888L, 9999999999999999L, Long.MAX_VALUE, Long.MAX_VALUE - 1, Long.MIN_VALUE, Long.MIN_VALUE + 1, (1L << 53), (1l << 53) + 1, (1l << 53) + 2, (1l << 60), (1l << 60) + 1, (1l << 60) + 8, (1l << 60) + 32, (1l << 60) + 64, (1l << 60) + 128, (1l << 60) + 256, (-1L << 53), (-1L << 53) - 1, (-1L << 53) - 2, (-1l << 60), (-1l << 60) - 1, (-1l << 60) - 8, (-1l << 60) - 32, (-1l << 60) - 64, (-1l << 60) - 128, (-1l << 60) - 256 }; } 

There are small differences in our environment.

Me: Windows 10, JDK 1.8.0_45, zeros are the fastest

Him: Windows 7, JDK 1.8.0_20, "cast" is the fastest

Our results are self-consistent from run to run, whether working in the IDE or from the command line. We use JMH 1.10.5.

What's going on here? This test seems unreliable and I don't know how to fix it.

+8
java performance jmh
source share
3 answers

I can reproduce different results even on the same machine with the same environment: sometimes cast little faster, sometimes there are zeros .

 # JMH 1.10.5 (released 9 days ago) # VM invoker: C:\Program Files\Java\jdk1.8.0_40\jre\bin\java.exe # VM options: -Didea.launcher.port=7540 -Didea.launcher.bin.path=C:\Program Files (x86)\IDEA 14.1.3\bin -Dfile.encoding=UTF-8 # Warmup: 3 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: bench.LongDouble.cast # Run progress: 0,00% complete, ETA 00:01:20 # Fork: 1 of 5 # Warmup Iteration 1: 513,793 ns/op # Warmup Iteration 2: 416,508 ns/op # Warmup Iteration 3: 402,110 ns/op Iteration 1: 402,535 ns/op Iteration 2: 403,999 ns/op Iteration 3: 404,871 ns/op Iteration 4: 404,845 ns/op Iteration 5: 401,705 ns/op # Run progress: 10,00% complete, ETA 00:01:16 # Fork: 2 of 5 # Warmup Iteration 1: 421,552 ns/op # Warmup Iteration 2: 418,925 ns/op # Warmup Iteration 3: 421,813 ns/op Iteration 1: 420,978 ns/op Iteration 2: 422,940 ns/op Iteration 3: 422,009 ns/op Iteration 4: 423,011 ns/op Iteration 5: 422,406 ns/op # Run progress: 20,00% complete, ETA 00:01:07 # Fork: 3 of 5 # Warmup Iteration 1: 414,057 ns/op # Warmup Iteration 2: 410,364 ns/op # Warmup Iteration 3: 402,330 ns/op Iteration 1: 402,776 ns/op Iteration 2: 404,764 ns/op Iteration 3: 400,346 ns/op Iteration 4: 403,227 ns/op Iteration 5: 403,350 ns/op # Run progress: 30,00% complete, ETA 00:00:58 # Fork: 4 of 5 # Warmup Iteration 1: 422,161 ns/op # Warmup Iteration 2: 419,118 ns/op # Warmup Iteration 3: 402,990 ns/op Iteration 1: 401,592 ns/op Iteration 2: 402,999 ns/op Iteration 3: 403,035 ns/op Iteration 4: 402,625 ns/op Iteration 5: 403,396 ns/op # Run progress: 40,00% complete, ETA 00:00:50 # Fork: 5 of 5 # Warmup Iteration 1: 422,621 ns/op # Warmup Iteration 2: 419,596 ns/op # Warmup Iteration 3: 403,047 ns/op Iteration 1: 403,438 ns/op Iteration 2: 405,066 ns/op Iteration 3: 403,271 ns/op Iteration 4: 403,021 ns/op Iteration 5: 402,162 ns/op Result "cast": 406,975 ?(99.9%) 5,906 ns/op [Average] (min, avg, max) = (400,346, 406,975, 423,011), stdev = 7,884 CI (99.9%): [401,069, 412,881] (assumes normal distribution) # JMH 1.9.3 (released 114 days ago, please consider updating!) # VM invoker: C:\Program Files\Java\jdk1.8.0_40\jre\bin\java.exe # VM options: -Didea.launcher.port=7540 -Didea.launcher.bin.path=C:\Program Files (x86)\IDEA 14.1.3\bin -Dfile.encoding=UTF-8 # Warmup: 3 iterations, 1 s each # Measurement: 5 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: bench.LongDouble.zeros # Run progress: 50,00% complete, ETA 00:00:41 # Fork: 1 of 5 # Warmup Iteration 1: 439,529 ns/op # Warmup Iteration 2: 437,752 ns/op # Warmup Iteration 3: 390,530 ns/op Iteration 1: 389,394 ns/op Iteration 2: 391,453 ns/op Iteration 3: 390,446 ns/op Iteration 4: 390,822 ns/op Iteration 5: 389,850 ns/op # Run progress: 60,00% complete, ETA 00:00:33 # Fork: 2 of 5 # Warmup Iteration 1: 438,252 ns/op # Warmup Iteration 2: 437,446 ns/op # Warmup Iteration 3: 448,328 ns/op Iteration 1: 389,979 ns/op Iteration 2: 392,741 ns/op Iteration 3: 390,575 ns/op Iteration 4: 390,492 ns/op Iteration 5: 390,000 ns/op # Run progress: 70,00% complete, ETA 00:00:25 # Fork: 3 of 5 # Warmup Iteration 1: 447,939 ns/op # Warmup Iteration 2: 444,489 ns/op # Warmup Iteration 3: 414,433 ns/op Iteration 1: 417,409 ns/op Iteration 2: 413,518 ns/op Iteration 3: 413,388 ns/op Iteration 4: 414,040 ns/op Iteration 5: 415,935 ns/op # Run progress: 80,00% complete, ETA 00:00:16 # Fork: 4 of 5 # Warmup Iteration 1: 439,012 ns/op # Warmup Iteration 2: 437,345 ns/op # Warmup Iteration 3: 388,208 ns/op Iteration 1: 395,647 ns/op Iteration 2: 389,221 ns/op Iteration 3: 387,539 ns/op Iteration 4: 388,524 ns/op Iteration 5: 387,623 ns/op # Run progress: 90,00% complete, ETA 00:00:08 # Fork: 5 of 5 # Warmup Iteration 1: 446,116 ns/op # Warmup Iteration 2: 446,622 ns/op # Warmup Iteration 3: 409,116 ns/op Iteration 1: 409,761 ns/op Iteration 2: 410,146 ns/op Iteration 3: 410,060 ns/op Iteration 4: 409,370 ns/op Iteration 5: 411,114 ns/op Result "zeros": 399,162 ?(99.9%) 8,487 ns/op [Average] (min, avg, max) = (387,539, 399,162, 417,409), stdev = 11,330 CI (99.9%): [390,675, 407,649] (assumes normal distribution) # Run complete. Total time: 00:01:23 Benchmark Mode Cnt Score Error Units LongDouble.cast avgt 25 406,975 ยฑ 5,906 ns/op LongDouble.zeros avgt 25 399,162 ยฑ 8,487 ns/op 

After some analysis, I found that the problem is not in the benchmark, but in JMH. The perfasm profiler pointed to the Blackhole.consume method:

 public final void consume(boolean bool) { boolean bool1 = this.bool1; // volatile read boolean bool2 = this.bool2; if (bool == bool1 & bool == bool2) { // SHOULD NEVER HAPPEN nullBait.bool1 = bool; // implicit null pointer exception } } 

The interesting part is how bool1 and bool2 :

 Random r = new Random(System.nanoTime()); ... bool1 = r.nextBoolean(); bool2 = !bool1; 

Yes, they are random every time! As you know, the JIT compiler relies on the execution profile at runtime, so the generated code changes slightly depending on the initial values โ€‹โ€‹of bool1 and bool2 , in particular, in half the cases it considers the branch, and the rest half is incomplete. This is where the difference is.

I submitted a report against JMH with a proposed fix if the authors confirm the defect.

+6
source share

As JB Nizet notes, you cannot assume that the program will do the same on multiple JVMs and / or operating systems, even if you have different machines.

By the way, you do not need numberOfLeadingZeroes(a) :

 public static boolean zeroes2(final long x) { final long a = Math.abs(x); return (a >>> Long.numberOfTrailingZeros(a)) < (1L << 53); } 

Finally, if you really need maximum performance, either select an arbitrary selection of machines for testing, or choose the one that works best (if there are no machines, where it works much worse, although this is unlikely with your sample code), or add all the methods and create a calibration program that checks all versions and selects the one that is the fastest for the machine it runs on.

Edit: Also, as Javier claims, be sure to compare with several workloads similar to the real world.

+2
source share

CPU processes change over time. If the performance balance is changing , or there are several new improvements in branch prediction then you have a constant difference.

If you want to abandon the fact that one alternative has a side advantage over a particular data set (for example, the ability to guess the following case, where in real life this might not work), you can randomize / shuffle your data set and make it longer. Just if you want to try, although it may be useless (and doubtful if you need to perform accurate data).

 final int times = 10; List<Long> listShuffle = new ArrayList<>(NUMBERS.length * times); for (long l : NUMBERS) { for (int i = 0; i < times; i++) { listShuffle.add(l); } } // Shake and serve chilled Collections.shuffle(listShuffle); 

PD # 1 In addition, if you can provide real data samples rather than synthetic data, this can provide more relevant information. A processor flaw in this case can be really good.

+2
source share

All Articles