Which Java API to use to read files in order to have better performance?

Question

Which Java API to use to read files in order to have better performance?

In my place where I work, files with more than a million lines per file were used. Despite the fact that the server memory is more than 10 GB, and for the JVM - 8 GB, sometimes the server is suspended several times and suffocates from other tasks.

I profiled the code and found that although file reading memory often increased in Giga bytes (from 1 GB to 3 GB), then it suddenly returned to normal. It seems that this frequent high and low memory uses my server freezes. Of course, this was due to garbage collection.

Which API should be used to read files to improve performance?

Now I am using BufferedReader(new FileReader(...)) to read these CSV files.

Process: How do I read a file?

I read files line by line.
Each row has several columns. based on the types that they analyze accordingly (cost column in double, visit column in int, keyword column in String, etc.).
I click the appropriate content (visit> 0) in the HashMap and finally clear this map at the end of the task

Update

I do this reading of 30 or 31 files (data for one month) and retains the right to the card. Later, this card is used to get some criminals in different tables. Therefore, reading is necessary and storing this data is also necessary. Although now I have switched the HashMap part to BerkeleyDB, the problem while reading the file was the same or even worse.

+4

java performance api filereader berkeley-db

DKSRathore Nov 28 '09 at 13:51

source share

3 answers

I profiled the code and found that while the use of file reading memory increases in Giga Byte Frequencies (1 to 3 GB) and then suddenly returns to normal. It seems that this frequent high and low memory uses freezes my servers. Of course, it was associated with the Garbage collection.

Using BufferedReader(new FileReader(...)) will not do this.

I suspect the problem is that you are reading strings / strings into an array or list, processing them and then dropping the array / list. This will lead to an increase in memory usage and a subsequent decrease. If so, you can reduce memory usage by processing each line / line while reading it.

EDIT . We agree that the problem is with the space used to represent the contents of the file in memory. An alternative to a huge in-memory hash table is to revert to the old "merge" approaches that we used when computer memory was measured in kilobytes. (I assume that the processing is dominated by the step in which you search with keys K to get the associated string R.)

If necessary, pre-process each of the input files so that they can be sorted by pressing K.
Use the efficient file sorting utility to sort all input files in order in K. You want to use a utility that will use the classic merge sorting algorithm. This will split each file into smaller pieces that can be sorted in memory, sort the pieces, write them to temporary files, and then merge the sorted temporary files. The UNIX / Linux sort utility is a good option.
Read the sorted files in parallel, reading all the lines related to each key value from all files, processing them, and then move on to the next key value.

In fact, I'm a little surprised that using BerkeleyDB did not help. However, if profiling tells you that the most time was when creating the database, you can speed it up by sorting the input file (as indicated above!) In ascending order of the key before building the database. (When creating a large file-based index, you get better performance if entries are added in the order of the keys.)

+5

Stephen c Nov 28 '09 at 13:58

source share

Try using the following vm options to configure gc (and do some gc printing):

 -verbose:gc -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

+1

eran Apr 05 '11 at 5:23

source share

Carl Smotricz · Accepted Answer · 2009-11-28T13:59:57+0000

BufferedReader is one of the two best APIs to use. If you really had problems reading files, an alternative would be to use the material in NIO for a memory card of your files, and then read the contents directly from memory.

But your problem is not with the reader. Your problem is that each read operation creates a bunch of new objects, most likely in the materials that you do immediately after reading.

You should consider clearing your input processing in order to reduce the number and / or size of the generated objects, or simply get rid of the objects faster if you no longer need them. Is it possible to process your file one line or a fragment at a time, and not to inhale all this into memory for processing?

Another possibility would be to tinker with garbage collection. You have two mechanisms:

Explicitly calling the garbage collector every time, say, every 10 seconds or every 1000 lines of input or something like that. This will increase the amount of work performed by the GC, but for each GC it will take less time, your memory will not increase, and I hope there will be less on the rest of the server.
Confused with JVM garbage collector options. They differ between the JVMs, but java -X should give you some tips.

Update: The most promising approach:

Do you really need the entire dataset in memory at a time for processing?

Which Java API to use to read files in order to have better performance?

More articles: