Replacing Java Files

I have a large 250 GB .txt file and I have only 50 GB of space left on my hard drive. Each line in this .txt file has a long prefix, and I want to remove this prefix to make this file smaller.

At first I wanted to read line by line, modify it, and write to another file.

// read line out of first file line = line.replace(prefix, ""); // write line into second file 

The problem is that I do not have enough space for this.

So how can I remove all the prefixes from my file?

+6
source share
3 answers

Check RandomAccessFile: http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html

You need to track the position in which you are reading and the position you are writing to. Initially, both begin. Then you read N bytes (one line), shorten it, look back N bytes and write M bytes (shortened line). Then you look ahead (N - M) bytes to return to the position where the next line begins. Then you do it again and again. As a result, truncate the excess with setLength (long).

You can also do this in batches (e.g. read 4kb, process, write, repeat) to make it more efficient.

The process is identical in all languages. Some simplify things by hiding back and forth searches behind the API.

Of course, you must be absolutely sure that your program works flawlessly, as there is no way to cancel this process.

In addition, RandomAccessFile is a bit limited because it cannot tell you what position the file is currently at. Therefore, you need to do the conversion between the "decoded strings" and the "encoded bytes" along the way. If your file is in UTF-8, a given character in a string can take up one or more bytes in the file. Therefore, you cannot just search (string.length ()). You should use seek (string.getBytes (encoding) .length) and the ratio of possible line break conversions (Windows uses two characters to break the line, Unix uses only one). But if you have ASCII, ISO-Latin-1 or similar trivial character encoding and know which lines interrupt the characters that have the file, the problem should be pretty simple.

And when I edit my answer to all possible angular cases, I think it would be better to read the file using BufferedReader and fix the character encoding, and also open RandomAccessFile for writing. If your OS supports opening the file twice. This way you get full Unicode support from BufferedReader, and you won’t need to keep track of read and write positions. You should write with RandomAccessFile, because using Writer in a file may just truncate it (although you haven’t tried it).

Something like that. It works on trivial examples, but it does not have error checking, and I absolutely do not give any guarantees. First check it for a smaller file.

 public static void main(String[] args) throws IOException { File f = new File(args[0]); BufferedReader reader = new BufferedReader(new InputStreamReader( new FileInputStream(f), "UTF-8")); // Use correct encoding here. RandomAccessFile writer = new RandomAccessFile(f, "rw"); String line = null; long totalWritten = 0; while ((line = reader.readLine()) != null) { line = line.trim() + "\n"; // Remove your prefix here. byte[] b = line.getBytes("UTF-8"); writer.write(b); totalWritten += b.length; } reader.close(); writer.setLength(totalWritten); writer.close(); } 
+8
source

You can use RandomAccessFile. This allows you to overwrite parts of a file. And since there is no copy or caching mechanism in javadoc, this should work without additional disk space.

This way you can overwrite unnecessary parts with spaces.

0
source

Since this does not need to be done in Java , I would recommend Python for this:

Save the following in replace.py in the same folder with your text file:

 import fileinput for line in fileinput.input("your-file.txt", inplace=True): print "%s" % (line.replace("oldstring", "newstring")) 

replace two lines with your line and do python replace.py

-1
source

All Articles