Reading large files in Java

I need advice from someone who knows Java well and memory issues. I have a large file (something like 1.5 GB), and I need to cut this file in many (for example, small small files).

I know how to do this (using BufferedReader ), but I would like to know if you have any memory tips or hints on how to do this faster.

My file contains text, it is not binary, and I have about 20 characters per line.

+50
java memory-management file
Mar 01 '10 at 13:41
source share
10 answers

Firstly, if your file contains binary data, using a BufferedReader will be a big mistake (because you will convert the data to String, which is not necessary and may lead to data corruption); you should use a BufferedInputStream . If this is text data, and you need to split them along the lines, then using BufferedReader is fine (if the file contains lines of reasonable length).

As far as memory is concerned, there should be no problem if you use a decent sized buffer (I would use at least 1 MB to make sure that HD does mostly sequential read and write).

If speed proves to be a problem, you can look at java.nio packages - presumably faster than java.io ,

+25
Mar 01 '10 at 13:52
source share

To save memory, do not unnecessarily store / duplicate data in memory (i.e. do not assign them to variables outside the loop). Just process the output as soon as input begins.

It doesn’t matter if you use BufferedReader or not. It will not cost much more memory, as some implicitly suggest. It will reach the highest level in only a few percent of productivity. The same applies to using NIO. This will improve scalability rather than memory usage. This will become interesting only when hundreds of themes will work in one file.

Just scroll through the file, immediately write each line to another file as you read it, count the lines and reach 100, then switch to the next file, etc.

Kickoff example:

 String encoding = "UTF-8"; int maxlines = 100; BufferedReader reader = null; BufferedWriter writer = null; try { reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding)); int count = 0; for (String line; (line = reader.readLine()) != null;) { if (count++ % maxlines == 0) { close(writer); writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding)); } writer.write(line); writer.newLine(); } } finally { close(writer); close(reader); } 
+26
Mar 01 '10 at 13:44
source share

You can use memory mapped files through FileChannel .

Usually for large files much faster. There are performance tradeoffs that can make it slower, which is why YMMV.

Related answer: Java NIO FileChannel and FileOutputstream performance / utility

+12
Mar 01 '10 at 13:57
source share

This is a very good article: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

In general, for excellent performance, you should:

  • Avoid accessing the drive.
  • Avoid access to the underlying operating system.
  • Avoid method calls.
  • Avoid handling bytes and characters separately.

For example, to reduce disk access, you can use a large buffer. The article describes the various approaches.

+4
Mar 01 '10 at 13:44
source share

Does it need to be done in Java? That is, Do I need to be platform independent? If not, I would suggest using the split 'command in * nix. If you really wanted this, you can execute this command through your java program. Although I have not tested, I believe that it works faster than any Java IO implementation that you could come up with.

+3
Mar 01
source share

You can use java.nio, which is faster than the classic input / output stream:

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

+1
Mar 01 '10 at 13:44
source share

Yes. I also think that using read () with arguments such as read (Char [], int init, int end) is the best way to read such a large file (For example: read (buffer, 0, buffer.length))

And I also ran into the problem of missing values ​​using BufferedReader instead of BufferedInputStreamReader for the input binary data stream. Thus, using BufferedInputStreamReader is much better in this similar case.

+1
Oct 27 '10 at 6:55
source share

Do not use reading without arguments. It is very slow. Better read it for a buffer and quickly move it to a file.

Use bufferedInputStream because it supports binary reading.

And that’s all.

0
Mar 01 '10 at 13:44
source share

If you do not accidentally read in the entire input file, and do not read it line by line, then the main limitation will be disk speed. You can try to start with a file containing 100 lines, and write it to 100 different files, one line each and make the start mechanism work with the number of lines written to the current file. This program will be easily scalable for your situation.

0
Mar 01
source share

 package all.is.well; import java.io.IOException; import java.io.RandomAccessFile; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import junit.framework.TestCase; /** * @author Naresh Bhabat * Following implementation helps to deal with extra large files in java. This program is tested for dealing with 2GB input file. There are some points where extra logic can be added in future. Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object. It uses random access file,which is almost like streaming API. * **************************************** Notes regarding executor framework and its readings. Please note :ExecutorService executor = Executors.newFixedThreadPool(10); * for 10 threads:Total time required for reading and writing the text in * :seconds 349.317 * * For 100:Total time required for reading the text and writing : seconds 464.042 * * For 1000 : Total time required for reading and writing text :466.538 * For 10000 Total time required for reading and writing in seconds 479.701 * * */ public class DealWithHugeRecordsinFile extends TestCase { static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt"; static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt"; static volatile RandomAccessFile fileToWrite; static volatile RandomAccessFile file; static volatile String fileContentsIter; static volatile int position = 0; public static void main(String[] args) throws IOException, InterruptedException { long currentTimeMillis = System.currentTimeMillis(); try { fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles seriouslyReadProcessAndWriteAsynch(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } Thread currentThread = Thread.currentThread(); System.out.println(currentThread.getName()); long currentTimeMillis2 = System.currentTimeMillis(); double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0; System.out.println("Total time required for reading the text in seconds " + time_seconds); } /** * @throws IOException * Something asynchronously serious */ public static void seriouslyReadProcessAndWriteAsynch() throws IOException { ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class while (true) { String readLine = file.readLine(); if (readLine == null) { break; } Runnable genuineWorker = new Runnable() { @Override public void run() { // do hard processing here in this thread,i have consumed // some time and ignore some exception in write method. writeToFile(FILEPATH_WRITE, readLine); // System.out.println(" :" + // Thread.currentThread().getName()); } }; executor.execute(genuineWorker); } executor.shutdown(); while (!executor.isTerminated()) { } System.out.println("Finished all threads"); file.close(); fileToWrite.close(); } /** * @param filePath * @param data * @param position */ private static void writeToFile(String filePath, String data) { try { // fileToWrite.seek(position); data = "\n" + data; if (!data.contains("Randomization")) { return; } System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data); System.out.println("Lets consume through this loop"); int i=1000; while(i>0){ i--; } fileToWrite.write(data.getBytes()); throw new Exception(); } catch (Exception exception) { System.out.println("exception was thrown but still we are able to proceeed further" + " \n This can be used for marking failure of the records"); //exception.printStackTrace(); } } } 
0
09 Oct '16 at 5:13
source share



All Articles