Why reading FileInputStream is slower with a large array

If I read bytes from file to byte [], I can see that FileInputStream's performance is worse when the array is about 1 MB compared to 128 KB. On the two workstations that I tested, it is almost twice as fast with 128 KB. Why is this?

import java.io.*; public class ReadFileInChuncks { public static void main(String[] args) throws IOException { byte[] buffer1 = new byte[1024*128]; byte[] buffer2 = new byte[1024*1024]; String path = "some 1 gb big file"; readFileInChuncks(path, buffer1, false); readFileInChuncks(path, buffer1, true); readFileInChuncks(path, buffer2, true); readFileInChuncks(path, buffer1, true); readFileInChuncks(path, buffer2, true); } public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException { long t = System.currentTimeMillis(); InputStream is = new FileInputStream(path); while ((readToArray(is, buffer)) != 0) {} if (report) System.out.println((System.currentTimeMillis()-t) + " ms"); } public static int readToArray(InputStream is, byte[] buffer) throws IOException { int index = 0; while (index != buffer.length) { int read = is.read(buffer, index, buffer.length - index); if (read == -1) break; index += read; } return index; } } 

exits

 422 ms 717 ms 422 ms 718 ms 

Note that this is an override of an already asked question. Another was contaminated by unrelated discussions. I will mark another for deletion.

Edit: Duplicate, right? I'm sure I could do some better code to prove my point, but this> does not answer my question

Edit2: I checked the test with each buffer between 5 KB and 1000 KB on
Win7 / JRE 1.8.0_25, and poor performance starts with an accuracy of 508 KB and all subsequent ones. Sorry for the wrong charts, x is the size of the buffer, y is the milliseconds

enter image description here

+8
java performance io fileinputstream
source share
4 answers

TL; DR . The performance degradation is caused by memory allocation, not file reading problems.

A typical benchmarking problem: you are comparing one thing, but actually measuring another.

First of all, when I rewrote the sample code using RandomAccessFile , FileChannel and ByteBuffer.allocateDirect , the threshold disappeared. File reading performance is approximately the same for 128K and 1M buffers.

Unlike direct ByteBuffer I / O, FileInputStream.read cannot load data directly into a Java byte array. First, he needs to first get the data into some native buffer, and then copy it to Java using the JNI function SetByteArrayRegion .

So, we should look at the built-in implementation of FileInputStream.read . The following code snippet is provided in io_util.c :

  if (len == 0) { return 0; } else if (len > BUF_SIZE) { buf = malloc(len); if (buf == NULL) { JNU_ThrowOutOfMemoryError(env, NULL); return 0; } } else { buf = stackBuf; } 

Here BUF_SIZE == 8192. If the buffer is larger than this reserved area of ​​the stack, the temporary buffer is allocated malloc . On Windows, malloc usually executed through a HeapAlloc WINAPI call.

Then I measured the performance of the HeapAlloc + HeapFree myself without file input / output. The results were interesting:

  128K: 5 μs 256K: 10 μs 384K: 15 μs 512K: 20 μs 640K: 25 μs 768K: 29 μs 896K: 33 μs 1024K: 316 μs <-- almost 10x leap 1152K: 356 μs 1280K: 399 μs 1408K: 436 μs 1536K: 474 μs 1664K: 511 μs 1792K: 553 μs 1920K: 592 μs 2048K: 628 μs 

As you can see, the OS memory allocation performance changes dramatically at the 1 MB border. This can be explained by the various distribution algorithms used for small fragments and for large fragments.

UPDATE

The documentation for HeapCreate confirms the idea of ​​a specific allocation strategy for blocks larger than 1 MB (see dwMaximumSize for a description).

In addition, the largest block of memory that can be allocated from the heap is slightly less than 512 KB for a 32-bit process and slightly less than 1,024 KB for a 64-bit process.

...

Requests to allocate memory blocks that exceed the limit for a fixed-size heap are not automatically executed; instead, the system calls the VirtualAlloc function to get the memory needed for large blocks.

+8
source share

The optimal buffer size depends on the file system block size, processor cache size, and cache latency. Most os'es use a block size of 4096 or 8192, so it is recommended that you use a buffer with this size or multiplicity of this value.

+1
source share

I rewrote the test to test different buffer sizes.

Here is the new code:

 public class ReadFileInChunks { public static void main(String[] args) throws IOException { String path = "C:\\\\tmp\\1GB.zip"; readFileInChuncks(path, new byte[1024 * 128], false); for (int i = 1; i <= 1024; i+=10) { readFileInChuncks(path, new byte[1024 * i], true); } } public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException { long t = System.currentTimeMillis(); InputStream is = new FileInputStream(path); while ((readToArray(is, buffer)) != 0) { } if (report) { System.out.println("buffer size = " + buffer.length/1024 + "kB , duration = " + (System.currentTimeMillis() - t) + " ms"); } } public static int readToArray(InputStream is, byte[] buffer) throws IOException { int index = 0; while (index != buffer.length) { int read = is.read(buffer, index, buffer.length - index); if (read == -1) { break; } index += read; } return index; } } 

And here are the results ...

 buffer size = 121kB , duration = 320 ms buffer size = 131kB , duration = 330 ms buffer size = 141kB , duration = 330 ms buffer size = 151kB , duration = 323 ms buffer size = 161kB , duration = 320 ms buffer size = 171kB , duration = 320 ms buffer size = 181kB , duration = 320 ms buffer size = 191kB , duration = 310 ms buffer size = 201kB , duration = 320 ms buffer size = 211kB , duration = 310 ms buffer size = 221kB , duration = 310 ms buffer size = 231kB , duration = 310 ms buffer size = 241kB , duration = 310 ms buffer size = 251kB , duration = 310 ms buffer size = 261kB , duration = 320 ms buffer size = 271kB , duration = 310 ms buffer size = 281kB , duration = 320 ms buffer size = 291kB , duration = 310 ms buffer size = 301kB , duration = 319 ms buffer size = 311kB , duration = 320 ms buffer size = 321kB , duration = 310 ms buffer size = 331kB , duration = 320 ms buffer size = 341kB , duration = 310 ms buffer size = 351kB , duration = 320 ms buffer size = 361kB , duration = 310 ms buffer size = 371kB , duration = 320 ms buffer size = 381kB , duration = 311 ms buffer size = 391kB , duration = 310 ms buffer size = 401kB , duration = 310 ms buffer size = 411kB , duration = 320 ms buffer size = 421kB , duration = 310 ms buffer size = 431kB , duration = 310 ms buffer size = 441kB , duration = 310 ms buffer size = 451kB , duration = 320 ms buffer size = 461kB , duration = 310 ms buffer size = 471kB , duration = 310 ms buffer size = 481kB , duration = 310 ms buffer size = 491kB , duration = 310 ms buffer size = 501kB , duration = 310 ms buffer size = 511kB , duration = 320 ms buffer size = 521kB , duration = 300 ms buffer size = 531kB , duration = 310 ms buffer size = 541kB , duration = 312 ms buffer size = 551kB , duration = 311 ms buffer size = 561kB , duration = 320 ms buffer size = 571kB , duration = 310 ms buffer size = 581kB , duration = 314 ms buffer size = 591kB , duration = 320 ms buffer size = 601kB , duration = 310 ms buffer size = 611kB , duration = 310 ms buffer size = 621kB , duration = 310 ms buffer size = 631kB , duration = 310 ms buffer size = 641kB , duration = 310 ms buffer size = 651kB , duration = 310 ms buffer size = 661kB , duration = 301 ms buffer size = 671kB , duration = 310 ms buffer size = 681kB , duration = 310 ms buffer size = 691kB , duration = 310 ms buffer size = 701kB , duration = 310 ms buffer size = 711kB , duration = 300 ms buffer size = 721kB , duration = 310 ms buffer size = 731kB , duration = 310 ms buffer size = 741kB , duration = 310 ms buffer size = 751kB , duration = 310 ms buffer size = 761kB , duration = 311 ms buffer size = 771kB , duration = 310 ms buffer size = 781kB , duration = 300 ms buffer size = 791kB , duration = 300 ms buffer size = 801kB , duration = 310 ms buffer size = 811kB , duration = 310 ms buffer size = 821kB , duration = 300 ms buffer size = 831kB , duration = 310 ms buffer size = 841kB , duration = 310 ms buffer size = 851kB , duration = 300 ms buffer size = 861kB , duration = 310 ms buffer size = 871kB , duration = 310 ms buffer size = 881kB , duration = 310 ms buffer size = 891kB , duration = 304 ms buffer size = 901kB , duration = 310 ms buffer size = 911kB , duration = 310 ms buffer size = 921kB , duration = 310 ms buffer size = 931kB , duration = 299 ms buffer size = 941kB , duration = 321 ms buffer size = 951kB , duration = 310 ms buffer size = 961kB , duration = 310 ms buffer size = 971kB , duration = 310 ms buffer size = 981kB , duration = 310 ms buffer size = 991kB , duration = 295 ms buffer size = 1001kB , duration = 339 ms buffer size = 1011kB , duration = 302 ms buffer size = 1021kB , duration = 610 ms 

It seems that some threshold falls into a buffer about 1021 KB in size. Looking deeper into it, I see ...

 buffer size = 1017kB , duration = 310 ms buffer size = 1018kB , duration = 310 ms buffer size = 1019kB , duration = 602 ms buffer size = 1020kB , duration = 600 ms 

So it seems that some kind of doubling effect occurs when this threshold hits. My initial thoughts are that the readToArray while loop was looping twice as long as the threshold was reached, but that is not the case, the while loop only goes through one iteration, regardless of whether 300 ms or 600 ms is running. So let's look at the actual io_utils.c , which implements, actually reads data from disk for some hints.

 jint readBytes(JNIEnv *env, jobject this, jbyteArray bytes, jint off, jint len, jfieldID fid) { jint nread; char stackBuf[BUF_SIZE]; char *buf = NULL; FD fd; if (IS_NULL(bytes)) { JNU_ThrowNullPointerException(env, NULL); return -1; } if (outOfBounds(env, off, len, bytes)) { JNU_ThrowByName(env, "java/lang/IndexOutOfBoundsException", NULL); return -1; } if (len == 0) { return 0; } else if (len > BUF_SIZE) { buf = malloc(len); if (buf == NULL) { JNU_ThrowOutOfMemoryError(env, NULL); return 0; } } else { buf = stackBuf; } fd = GET_FD(this, fid); if (fd == -1) { JNU_ThrowIOException(env, "Stream Closed"); nread = -1; } else { nread = (jint)IO_Read(fd, buf, len); if (nread > 0) { (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf); } else if (nread == JVM_IO_ERR) { JNU_ThrowIOExceptionWithLastError(env, "Read error"); } else if (nread == JVM_IO_INTR) { JNU_ThrowByName(env, "java/io/InterruptedIOException", NULL); } else { /* EOF */ nread = -1; } } if (buf != stackBuf) { free(buf); } return nread; } 

It should be noted that BUF_SIZE is set to 8192. The doubling effect occurs above this. Thus, the next culprit will be the IO_Read method.

 windows/native/java/io/io_util_md.h:#define IO_Read handleRead 

So, we pass to the handleRead method.

 windows/native/java/io/io_util_md.c:handleRead(jlong fd, void *buf, jint len) 

This method passes the request to the ReadFile method.

 JNIEXPORT size_t handleRead(jlong fd, void *buf, jint len) { DWORD read = 0; BOOL result = 0; HANDLE h = (HANDLE)fd; if (h == INVALID_HANDLE_VALUE) { return -1; } result = ReadFile(h, /* File handle to read */ buf, /* address to put data */ len, /* number of bytes to read */ &read, /* number of bytes read */ NULL); /* no overlapped struct */ if (result == 0) { int error = GetLastError(); if (error == ERROR_BROKEN_PIPE) { return 0; /* EOF */ } return -1; } return read; } 

And here the path is cold ... for now. If I find the code for ReadFile, I will look and send back.

0
source share

This may be due to the processor cache,

cpu has its own cache and there is some fix size so that you can check the size of the processor cache by running this command on cmd

wmic cpu get L2CacheSize

Suppose you have 256k as the size of the processor cache, so what happens if you read fragments of 256 thousand or less, the contents that were written to the buffer are still in the CPU cache when the read accesses it. If you have more 256k blocks, then the last 256k that were read are in the CPU cache, so when the read starts from the beginning, the contents should be extracted from the main memory.

-one
source share

All Articles