Why is it slower than int in x64 Java?

Question

Why is it slower than int in x64 Java?

I am using Windows 8.1 x64 with the Java 7 45 x64 update (without a 32-bit Java installation) on a Surface Pro 2 tablet.

The following code is 1688 ms when I am long and 109 ms when I am int. Why is the long (64-bit type) an order of magnitude slower than the int on a 64-bit platform with a 64-bit JVM?

My only assumption is that the processor takes longer to add a 64-bit integer than a 32-bit, but this seems unlikely. I suspect that Haswell does not use gangsters.

I run this in Eclipse Kepler SR1, btw.

public class Main { private static long i = Integer.MAX_VALUE; public static void main(String[] args) { System.out.println("Starting the loop"); long startTime = System.currentTimeMillis(); while(!decrementAndCheck()){ } long endTime = System.currentTimeMillis(); System.out.println("Finished the loop in " + (endTime - startTime) + "ms"); } private static boolean decrementAndCheck() { return --i < 0; } }

Edit: Here are the results from the equivalent C ++ code compiled by VS 2013 (see below) on the same system. ~~long: 72265ms int: 74656ms~~ These results were in 32-bit debug mode.

In 64-bit mode: ~~long: 875 ms~~ long: 906 ms int: 1047 ms

This suggests that the result I observed is JVM optimization, not CPU limits.

 #include "stdafx.h" #include "iostream" #include "windows.h" #include "limits.h" long long i = INT_MAX; using namespace std; boolean decrementAndCheck() { return --i < 0; } int _tmain(int argc, _TCHAR* argv[]) { cout << "Starting the loop" << endl; unsigned long startTime = GetTickCount64(); while (!decrementAndCheck()){ } unsigned long endTime = GetTickCount64(); cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl; }

Edit: just tried it again in RTM Java 8, no major changes.

+90

java performance long-integer 32bit-64bit

Techrocket9 Nov 07 '13 at 18:41

source share

7 answers

The JVM stack is defined in terms of words whose size is an implementation detail but must be at least 32 bits wide. The JVM developer can use 64-bit words, but the bytecode cannot rely on this, and therefore operations with long or double values must be handled with extreme care. In particular, the instructions of integer integer JVMs are defined exactly as the type int .

In the case of your code, disassembly is instructive. Here is the bytecode for the int version compiled by Oracle JDK 7:

 private static boolean decrementAndCheck(); Code: 0: getstatic #14 // Field i:I 3: iconst_1 4: isub 5: dup 6: putstatic #14 // Field i:I 9: ifge 16 12: iconst_1 13: goto 17 16: iconst_0 17: ireturn

Please note that the JVM will load the value of your static i (0), subtract one (3-4), duplicate the value on the stack (5) and insert it back into the variable (6). Then it executes a countdown branch and returns.

The long version is a bit more complicated:

 private static boolean decrementAndCheck(); Code: 0: getstatic #14 // Field i:J 3: lconst_1 4: lsub 5: dup2 6: putstatic #14 // Field i:J 9: lconst_0 10: lcmp 11: ifge 18 14: iconst_1 15: goto 19 18: iconst_0 19: ireturn

First, when the JVM duplicates a new value on the stack (5), it must duplicate the two words of the stack. In your case, it is quite possible that this is no more expensive than duplication, since the JVM is free to use a 64-bit word, if convenient. However, you will notice that the branching logic is longer here. The JVM does not have instructions for comparing a long with zero, so it must push the constant 0L the stack (9), perform a general comparison of long (10), and then enter the value of this calculation.

Here are two likely scenarios:

The JVM exactly matches the bytecode path. In this case, it works more in the long version by clicking and setting a few additional values, and they are in the virtual managed stack, and not in the real processor stack. If so, you will still see a significant difference in performance after the warm-up.
The JVM understands that it can optimize this code. In this case, additional time is required to optimize some practically unnecessary push / compare logic. If so, you will see a very small difference in performance after the warm-up.

I recommend that you write the correct microfunction to eliminate the effect of enabling JIT, and also try this with a final condition that is not zero, to force the JVM to do the same comparison on int as it does with long .

+22

chrylis Nov 07 '13 at 19:13

source share

The basic unit of data in a Java virtual machine is the word. Choosing the right word size is up to the JVM implementation. The JVM implementation should choose a minimum word size of 32 bits. He can choose a higher word size to increase efficiency. There is also no restriction that a 64-bit JVM should only select a 64-bit word.

The basic architecture does not indicate that the word size should also be the same. The JVM reads / writes data in word. It is for this reason that this may take longer than int .

Here you can find more information on the same topic.

+8

Vaibhav Raj Nov 07 '13 at 19:03

source share

I just wrote a test using caliper .

the results are quite consistent with the source code: ~ 12x acceleration for using int over long . Of course, it seems that a cycle showing the tmyklebu reported , or something very similar, is happening.

 timeIntDecrements 195,266,845.000 timeLongDecrements 2,321,447,978.000

This is my code; note that it uses the newly created caliper snapshot, since I could not figure out how to encode its existing beta.

 package test; import com.google.caliper.Benchmark; import com.google.caliper.Param; public final class App { @Param({""+1}) int number; private static class IntTest { public static int v; public static void reset() { v = Integer.MAX_VALUE; } public static boolean decrementAndCheck() { return --v < 0; } } private static class LongTest { public static long v; public static void reset() { v = Integer.MAX_VALUE; } public static boolean decrementAndCheck() { return --v < 0; } } @Benchmark int timeLongDecrements(int reps) { int k=0; for (int i=0; i<reps; i++) { LongTest.reset(); while (!LongTest.decrementAndCheck()) { k++; } } return (int)LongTest.v | k; } @Benchmark int timeIntDecrements(int reps) { int k=0; for (int i=0; i<reps; i++) { IntTest.reset(); while (!IntTest.decrementAndCheck()) { k++; } } return IntTest.v | k; } }

+4

tucuxi Nov 07 '13 at 21:37

source share

For the record, this version does a crude “warm-up”:

 public class LongSpeed { private static long i = Integer.MAX_VALUE; private static int j = Integer.MAX_VALUE; public static void main(String[] args) { for (int x = 0; x < 10; x++) { runLong(); runWord(); } } private static void runLong() { System.out.println("Starting the long loop"); i = Integer.MAX_VALUE; long startTime = System.currentTimeMillis(); while(!decrementAndCheckI()){ } long endTime = System.currentTimeMillis(); System.out.println("Finished the long loop in " + (endTime - startTime) + "ms"); } private static void runWord() { System.out.println("Starting the word loop"); j = Integer.MAX_VALUE; long startTime = System.currentTimeMillis(); while(!decrementAndCheckJ()){ } long endTime = System.currentTimeMillis(); System.out.println("Finished the word loop in " + (endTime - startTime) + "ms"); } private static boolean decrementAndCheckI() { return --i < 0; } private static boolean decrementAndCheckJ() { return --j < 0; } }

The total time improves by about 30%, but the ratio between them remains approximately the same.

+1

Hot Licks Nov 07 '13 at 19:47

source share

For entries:

if i use

 boolean decrementAndCheckLong() { lo = lo - 1l; return lo < -1l; }

(by changing "l--" to "l = l - 1l") long-term productivity improves by ~ 50%

+1

R.Moeller Nov 07 '13 at 20:35

source share

I don’t have a 64-bit machine for testing, but a rather big difference indicates that at work there is more than a slightly longer bytecode.

I see very close times for long / int (4400 vs 4800 ms) on my 32-bit 1.7.0_45.

This is only an assumption, but I strongly suspect that this is the effect of breaking a memory mismatch. To confirm / reject the suspicion, try adding a public static int dummy = 0; before announcement i. This will result in a decrease of 4 bytes in the memory layout and may result in proper alignment for better performance. It is confirmed that this does not cause a problem.

EDIT: ~~The reason for this is that the virtual machine cannot reorder the fields at its leisure, adding an addition for optimal alignment, as this may interfere with JNI~~ (not in case).

0

Durandal Nov 07 '13 at 19:23

source share

tmyklebu · Accepted Answer · 2013-11-07 19:54

My JVM does this quite straightforwardly in the inner loop when you use long s:

 0x00007fdd859dbb80: test %eax,0x5f7847a(%rip) /* fun JVM hack */ 0x00007fdd859dbb86: dec %r11 /* i-- */ 0x00007fdd859dbb89: mov %r11,0x258(%r10) /* store i to memory */ 0x00007fdd859dbb90: test %r11,%r11 /* unnecessary test */ 0x00007fdd859dbb93: jge 0x00007fdd859dbb80 /* go back to the loop top */

It is cheating, badly when you use int s; at first there is some kind of talkativeness, which I do not claim to understand, but looks like a setting for an expanded cycle:

 0x00007f3dc290b5a1: mov %r11d,%r9d 0x00007f3dc290b5a4: dec %r9d 0x00007f3dc290b5a7: mov %r9d,0x258(%r10) 0x00007f3dc290b5ae: test %r9d,%r9d 0x00007f3dc290b5b1: jl 0x00007f3dc290b662 0x00007f3dc290b5b7: add $0xfffffffffffffffe,%r11d 0x00007f3dc290b5bb: mov %r9d,%ecx 0x00007f3dc290b5be: dec %ecx 0x00007f3dc290b5c0: mov %ecx,0x258(%r10) 0x00007f3dc290b5c7: cmp %r11d,%ecx 0x00007f3dc290b5ca: jle 0x00007f3dc290b5d1 0x00007f3dc290b5cc: mov %ecx,%r9d 0x00007f3dc290b5cf: jmp 0x00007f3dc290b5bb 0x00007f3dc290b5d1: and $0xfffffffffffffffe,%r9d 0x00007f3dc290b5d5: mov %r9d,%r8d 0x00007f3dc290b5d8: neg %r8d 0x00007f3dc290b5db: sar $0x1f,%r8d 0x00007f3dc290b5df: shr $0x1f,%r8d 0x00007f3dc290b5e3: sub %r9d,%r8d 0x00007f3dc290b5e6: sar %r8d 0x00007f3dc290b5e9: neg %r8d 0x00007f3dc290b5ec: and $0xfffffffffffffffe,%r8d 0x00007f3dc290b5f0: shl %r8d 0x00007f3dc290b5f3: mov %r8d,%r11d 0x00007f3dc290b5f6: neg %r11d 0x00007f3dc290b5f9: sar $0x1f,%r11d 0x00007f3dc290b5fd: shr $0x1e,%r11d 0x00007f3dc290b601: sub %r8d,%r11d 0x00007f3dc290b604: sar $0x2,%r11d 0x00007f3dc290b608: neg %r11d 0x00007f3dc290b60b: and $0xfffffffffffffffe,%r11d 0x00007f3dc290b60f: shl $0x2,%r11d 0x00007f3dc290b613: mov %r11d,%r9d 0x00007f3dc290b616: neg %r9d 0x00007f3dc290b619: sar $0x1f,%r9d 0x00007f3dc290b61d: shr $0x1d,%r9d 0x00007f3dc290b621: sub %r11d,%r9d 0x00007f3dc290b624: sar $0x3,%r9d 0x00007f3dc290b628: neg %r9d 0x00007f3dc290b62b: and $0xfffffffffffffffe,%r9d 0x00007f3dc290b62f: shl $0x3,%r9d 0x00007f3dc290b633: mov %ecx,%r11d 0x00007f3dc290b636: sub %r9d,%r11d 0x00007f3dc290b639: cmp %r11d,%ecx 0x00007f3dc290b63c: jle 0x00007f3dc290b64f 0x00007f3dc290b63e: xchg %ax,%ax /* OK, fine; I know what a nop looks like */

then the expanded cycle itself:

 0x00007f3dc290b640: add $0xfffffffffffffff0,%ecx 0x00007f3dc290b643: mov %ecx,0x258(%r10) 0x00007f3dc290b64a: cmp %r11d,%ecx 0x00007f3dc290b64d: jg 0x00007f3dc290b640

then the break code for the expanded cycle, the test itself and the direct cycle:

 0x00007f3dc290b64f: cmp $0xffffffffffffffff,%ecx 0x00007f3dc290b652: jle 0x00007f3dc290b662 0x00007f3dc290b654: dec %ecx 0x00007f3dc290b656: mov %ecx,0x258(%r10) 0x00007f3dc290b65d: cmp $0xffffffffffffffff,%ecx 0x00007f3dc290b660: jg 0x00007f3dc290b654

So this happens 16 times faster for int, because the JIT unwrapped the int loop 16 times, but didn't unwrap the long loop at all.

For completeness, here is the code I really tried:

 public class foo136 { private static int i = Integer.MAX_VALUE; public static void main(String[] args) { System.out.println("Starting the loop"); for (int foo = 0; foo < 100; foo++) doit(); } static void doit() { i = Integer.MAX_VALUE; long startTime = System.currentTimeMillis(); while(!decrementAndCheck()){ } long endTime = System.currentTimeMillis(); System.out.println("Finished the loop in " + (endTime - startTime) + "ms"); } private static boolean decrementAndCheck() { return --i < 0; } }

Assembly dumps were generated using the -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly . Note that you need to bother with your JVM installation so that this work is for you; you need to put some random shared library in the right place or it will fail.

Why is it slower than int in x64 Java?

More articles: