How can I speed up this java code?

Question

How can I speed up this java code?

I am trying to check how fast Java can accomplish a simple task: read a huge file in memory and then do some meaningless data calculations. All types of optimization are taken into account. Regardless of whether he rewrites the code differently or uses a different JVM, spoofing the JIT ..

The input file is a 500-million long list of 32-bit integer pairs, separated by commas. Like this:

44,439.5023
33140,22257
...

This file takes 5.5 GB on my machine. The program cannot use more than 8 GB of RAM and can only use a single stream .

package speedracer; import java.io.FileInputStream; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; public class Main { public static void main(String[] args) { int[] list = new int[1000000000]; long start1 = System.nanoTime(); parse(list); long end1 = System.nanoTime(); System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0); int rs = 0; long start2 = System.nanoTime(); for (int k = 0; k < list.length; k++) { rs = calc(list[k++], list[k++], list[k++], list[k]); } long end2 = System.nanoTime(); System.out.println(rs); System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0); } public static int calc(final int a1, final int a2, final int b1, final int b2) { int c1 = (a1 + a2) ^ a2; int c2 = (b1 - b2) << 4; for (int z = 0; z < 100; z++) { c1 ^= z + c2; } return c1; } public static void parse(int[] list) { FileChannel fc = null; int i = 0; MappedByteBuffer byteBuffer; try { fc = new FileInputStream("in.txt").getChannel(); long size = fc.size(); long allocated = 0; long allocate = 0; while (size > allocated) { if ((size - allocated) > Integer.MAX_VALUE) { allocate = Integer.MAX_VALUE; } else { allocate = size - allocated; } byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate); byteBuffer.clear(); allocated += allocate; int number = 0; while (byteBuffer.hasRemaining()) { char val = (char) byteBuffer.get(); if (val == '\n' || val == ',') { list[i] = number; number = 0; i++; } else { number = number * 10 + (val - '0'); } } } fc.close(); } catch (Exception e) { System.err.println("Parsing error: " + e); } } }

I tried everything I could think of. Having tried different readers, I tried openjdk6, sunjdk6, sunjdk7. Tried different readers. We'll have to do some ugly disassembly, since MappedByteBuffer cannot display more than 2 GB of memory at once. I'm runing:

  Linux AS292 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011 x86_64 GNU/Linux. Ubuntu 11.04. CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.

Currently, my results for parsing: 26.50s, calculations: 11.27s. I am competing with a similar C ++ benchmark that does IO at about the same time, but the calculations only take 4.5 s. My main goal is to reduce the computation time in any way possible. Any ideas?

Update: It seems that a major speed improvement may come from what is called Auto-Vectorization . I was able to find some hints that the current Sun JIT is only doing "some vectology", however I cannot confirm this. It would be great to find some JVMs or JITs that would have better support for auto-vectorization optimization.

+3

java optimization compiler-optimization benchmarking micro-optimization

Zilvinas Sep 16 '11 at 23:27

source share

7 answers

Use the JVM Hotspot in server mode and make sure to warm it up . Also, give enough time for the garbage collection algorithms to stabilize if the build is an important part of your test. I do not see anything at first sight that makes me think that it will be ...

+1

Ryan stewart Sep 16 '11 at 23:47

source share

An interesting question. :-) This is more of a comment, since I will not answer your question, but it is too long for the comment field.

Micro benchmarking in Java is difficult because JIT can work with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its usual optimizations.

Typically, this code will run O (1) times because your main loop does not affect anything:

  for (int k = 0; k < list.length; k++) { rs = calc(list[k++], list[k++], list[k++], list[k]); }

Note that the end result of rs is not really dependent on starting all iterations of the loop; just the last one. You can calculate the final “k” value for a loop without actually starting the loop. Usually JIT notices this and turns your loop into a single task, it is able to detect that the called function (calc) has no side effects (which it does not have).

But somehow this statement in the calc () function messed up the JIT:

  c1 ^= z + c2;

Somehow this adds too much complexity to JIT to decide that all this code at the end doesn’t change anything and that the original loop can be optimized.

If you change this particular statement to something more meaningless, for example:

  c1 = z + c2;

Then JIT selects things and optimizes your loops. Try it :-)

I tried locally with a much smaller dataset, and the calculations “^ =” were ~ 1.6 s, and with the version “=” they took 0.007 seconds (or, in other words, optimized the loop).

As I said, this is not really the answer, but I thought it might be interesting.

+1

vanza Sep 17 '11 at 0:48

source share

Have you tried "inlining" parse () and calc (), i.e. put all the code in main ()?

0

mbatchkarov Sep 17 '11 at 0:16

source share

What is evaluation if you move several lines of your calc function inside your list iteration?
I know this is not very clean, but you will get over the call stack.

 [...] for (int k = 0; k < list.length; k++) { int a1 = list[k++]; int a2 = list[k++]; int b1 = list[k++]; int b2 = list[k]; int c1 = (a1 + a2) ^ a2; int c2 = (b1 - b2) << 4; for (int z = 0; z < 100; z++) { c1 ^= z + c2; } rs = c1; }

0

Destroyica Sep 17 '11 at 0:16

source share

MappedByteBuffer contributes only about 20% of the I / O performance, and this is a huge cost of memory - if it causes a replacement treatment, it is worse than a disease.

I would use BufferedReader around FileReader and possibly Scanner to get integers, or at least Integer.parseInt (), which is likely to have been warmed up by HotSpot than your own conversion code converter.

0

Ejp Sep 17 '11 at 0:17

source share

I am trying to check how fast Java can accomplish a simple task: read a huge file in memory and then do some meaningless calculations in the data.

If the task is to make meaningless calculations, the best optimization is not to do the calculations .

If what you are really trying to do here is to find out if there is a general technique for speeding up calculations, then I think you are barking the wrong tree. There is no such technique. What you learn when optimizing a pointless calculation is hardly applicable to other (hopefully meaningful) calculations.

If the calculation is not meaningless, and the goal is to speed up the whole program, you have probably reached the point where optimization is a waste of time.

Current (Java) - 26.50s + 11.27s = ~ 38 seconds
Target (C ++) - ~ 26.5s + 4.50 = ~ 31 seconds
The acceleration percentage is less than 20%.

An acceleration of less than 20% when calculating ~ 40 seconds is probably not worth the effort. It's cheaper to get the user to spin their thumbs in these extra 7 seconds.

It also tells you something interesting. In this case, in relative terms, it doesn't really matter if you use C ++ or Java. The overall performance of the program is dominated by a phase in which C ++ and Java are comparable.

0

Stephen c Sep 17 '11 at 1:32

source share

Mysticial · Accepted Answer · 2011-09-17T01:29:52+0000

First of all, -O3 includes:

 -finline-functions -ftree-vectorize

among the others...

So it looks like it could be vectorization.

EDIT: This has been confirmed. (see comments) . The C ++ version is indeed encoded by the compiler. With vectoring disabled, the C ++ version actually works a little slower than the Java version

Assuming the JIT doesn't vectorize the loop, it may be difficult / impossible for the Java version to match the speed of the C ++ version.

Now, if I were a smart C / C ++ compiler, here is how I would arrange this loop (on x64):

 int c1 = (a1 + a2) ^ a2; int c2 = (b1 - b2) << 4; int tmp0 = c1; int tmp1 = 0; int tmp2 = 0; int tmp3 = 0; int z0 = 0; int z1 = 1; int z2 = 2; int z3 = 3; do{ tmp0 ^= z0 + c2; tmp1 ^= z1 + c2; tmp2 ^= z2 + c2; tmp3 ^= z3 + c2; z0 += 4; z1 += 4; z2 += 4; z3 += 4; }while (z0 < 100); tmp0 ^= tmp1; tmp2 ^= tmp3; tmp0 ^= tmp2; return tmp0;

Please note that this cycle is completely vectorial.

Even better, I would completely unroll this cycle. This is what the C / C ++ compiler will do. But now the question is, will JIT do this?

How can I speed up this java code?

More articles: