MappedByteBuffer.asFloatBuffer () against pop-up memory []

Let's say you do some calculations on a large set of large float vectors, for example. calculating the average value of each of them:

 public static float avg(float[] data, int offset, int length) { float sum = 0; for (int i = offset; i < offset + length; i++) { sum += data[i]; } return sum / length; } 

If you have all of your vectors stored in float[] , you can implement the loop as follows:

 float[] data; // <-- vectors here float sum = 0; for (int i = 0; i < nVectors; i++) { sum += avg(data, i * vectorSize, vectorSize); } 

If your vectors are stored in a file instead, memory matching should be as fast as the first solution, in theory , as soon as the OS caches all of this:

 RandomAccessFile file; // <-- vectors here MappedByteBuffer buffer = file.getChannel().map(READ_WRITE, 0, 4*data.length); FloatBuffer floatBuffer = buffer.asFloatBuffer(); buffer.load(); // <-- this forces the OS to cache the file float[] vector = new float[vectorSize]; float sum = 0; for (int i = 0; i < nVectors; i++) { floatBuffer.get(vector); sum += avg(vector, 0, vector.length); } 

However, my tests show that the version with memory mapping is ~ 5 times slower than in memory. I know that FloatBuffer.get(float[]) copying memory, and I think the reason for the slowdown. Could it be faster? Is there a way to avoid copying any memory and just get my data from the OS buffer?

I loaded my full test into this method , if you want to try just running it:

 $ java -Xmx1024m ArrayVsMMap 100 100000 100 

Edit:

In the end, the best I could get from MappedByteBuffer in this scenario is still slower than using regular float[] by ~ 35%. Tricks so far:

  • use your own byte order to avoid conversion: buffer.order(ByteOrder.nativeOrder())
  • wrap MappedByteBuffer with FloatBuffer using buffer.asFloatBuffer()
  • use a simple floatBuffer.get(int index) instead of the mass version, this avoids copying memory.

You can see the new test and the results of this meaning .

A slowdown of 1.35 is much better than one of 5, but it is still far from 1. I will probably still miss something, otherwise it is something in the JVM that needs to be improved.

+4
source share
2 answers

The time on your array is ridiculously fast! I get .0002 nanoseconds per float. The JVM is probably optimizing the loop.

This is the problem:

  void iterate() { for (int i = 0; i < nVectors; i++) { calc(data, i * vectorSize, vectorSize); } } 

The JVM understands that calc has no side effects, so iterate isn't there either, so it can just be replaced with NOP. A simple fix is ​​to copy the results from calc and return it. You also need to do the same with the iterate results in the synchronization loop and print the result. This prevents the optimizer from deleting all your code.

Edit:

It seems that this is probably just an overhead on the Java side, has nothing to do with the memory mapping itself, just an interface to it. Try the following test, which simply wraps a FloatBuffer around a ByteBuffer around a byte[] :

  private static final class ArrayByteBufferTest extends IterationTest { private final FloatBuffer floatBuffer; private final int vectorSize; private final int nVectors; ArrayByteBufferTest(float[] data, int vectorSize, int nVectors) { ByteBuffer bb = ByteBuffer.wrap(new byte[data.length * 4]); for (int i = 0; i < data.length; i++) { bb.putFloat(data[i]); } bb.rewind(); this.floatBuffer = bb.asFloatBuffer(); this.vectorSize = vectorSize; this.nVectors = nVectors; } float iterate() { float sum = 0; floatBuffer.rewind(); float[] vector = new float[vectorSize]; for (int i = 0; i < nVectors; i++) { floatBuffer.get(vector); sum += calc(vector, 0, vector.length); } return sum; } } 

Since you do so little work with the float itself (just adding it, possibly 1 cycle), the cost of reading 4 bytes, building a float and copying it to an array all adds up. I noticed that this helps the overhead to have fewer, larger vectors, at least until the vector becomes bigger than the cache (L1?).

+3
source

In theory, there is no reason to believe that they should do the same. The displayed solution implies page errors and an I / O drive in a completely unpredictable degree. The float [] array does not work. You should expect the latter to be faster, except in the special case when the entire file is mapped to memory and you never change it and it remains mapped and never unloaded. Most of these factors you cannot control or predict.

0
source

All Articles