Java multithreading read one large file

What is an efficient way for a multi-threaded Java application where many streams have to read the same file (> 1 GB in size) and expose it as an input stream? I noticed that if there are many threads (> 32), the system starts to struggle with I / O and expects a lot of pending I / O.

I looked at loading a file into a byte array that was shared by all streams - each stream created a ByteArrayInputStream, but allocating a 1 GB array just wouldn't work.

I also looked at using one FileChannel and each stream creating an InputStream on top of it using Channels.newInputStream (), however it seems like it's a FileChannel that maintains state for InputStream.

+7
java io concurrency
source share
4 answers

It seems to me that you will need to load the file into memory if you want to avoid IO competition. The operating system will do some buffering, but if you find that this is not enough, you will have to do it yourself.

Do you really need 32 threads? Presumably you don't have many cores, so use fewer threads and you will get fewer context switches, etc.

Do all your threads process the file from start to finish? If so, can you effectively split the file into pieces? Read the first (say) 10 MB of data in memory, let all the threads process it, then move on to the next 10 MB, etc.

If this does not work for you, how much memory did you compare with the file size? If you have a lot of memory, but you do not want to allocate one huge array, you can read the entire file in memory, but into many separate lower byte arrays. Then you need to write an input stream that spans all these byte arrays, but this should be doable.

+10
source share

you can open the file several times in readonly mode. You can access the file in any way. Just leave the OS caching. When it's too slow, you might consider some kind of chunk-based caching, where all threads can access the same cache.

+5
source share

A few ideas:

  • Write a custom implementation of InputStream that acts like a view on a FileChannel. Write this so that it does not rely on any state in the FileChannel. (i.e., each instance should track its own position, and the read should use absolute reads in the underlying FileChannel.) This at least helps you solve problems with .newInputStream () channels, but it may not solve your problems from the side IO.

  • Write a custom implementation of InputStream that acts like a view on a MappedByteBuffer. Map matching doesn't have to be as bad as actually reading all of this into memory at once, but you will still have 1 GB of virtual address space.

  • Same as # 1, but has some kind of common cache level. I would not try this if 1 is not effective enough and 2 is not feasible. In fact, the OS should already do some caching for you at # 1, so here you are essentially trying to be smarter than caching the OS file system.

+1
source share

This is a very large file. Can you get a file as a smaller set of files? Just delivering this file will be a big job even on a corporate network.

Sometimes it’s easier to change the process than the program.

Perhaps you are even better off writing something to split the file into several pieces and process them separately.

0
source share

All Articles