How to handle huge text files containing EOF / Ctrl-Z characters using Python on Windows?

I have some large comma-separated text files (the largest of about 15 GB) that I need to process with a Python script. The problem is that the files sporadically contain DOS EOF characters (Ctrl-Z) in the middle of them. (Do not ask me why, I did not create them.) Another problem is that the files are on a computer running Windows.

On Windows, when my script encounters one of these characters, it assumes that it is at the end of the file and stops processing. For various reasons, I am not allowed to copy files to another machine. But I still need to process them.

Here are my ideas:

  • Read the file in binary mode, throwing out bytes equal to chr(26) . This will work, but it will take about forever.
  • Use something like sed to exclude EOF characters. Unfortunately, as far as I can tell, sed for Windows has the same problem and will quit when it sees EOF.
  • Use some Notepad program and search and replace. But it turns out that Notepad -type programs do a good job with 15 gigabyte files.

My IDEAL solution would be some way to just read the file as text and just ignore the Ctrl-Z characters. Is there any reasonable way to do this?

+6
source share
1 answer

It is easy to use Python to remove DOS EOF characters; eg,

 def delete_eof(fin, fout): BUFSIZE = 2**15 EOFCHAR = chr(26) data = fin.read(BUFSIZE) while data: fout.write(data.translate(None, EOFCHAR)) data = fin.read(BUFSIZE) import sys ipath = sys.argv[1] opath = ipath + ".new" with open(ipath, "rb") as fin, open(opath, "wb") as fout: delete_eof(fin, fout) 

This takes the file path as the first argument and copies the file, but without chr(26) bytes, into the same file path with the addition of .new . Try to taste.

By the way, are you sure that DOS EOF characters are your only problem? It is difficult to imagine a reasonable way in which they can appear in files intended for processing text files.

+6
source

All Articles