How to handle huge text files containing EOF / Ctrl-Z characters using Python on Windows?

Question

How to handle huge text files containing EOF / Ctrl-Z characters using Python on Windows?

I have some large comma-separated text files (the largest of about 15 GB) that I need to process with a Python script. The problem is that the files sporadically contain DOS EOF characters (Ctrl-Z) in the middle of them. (Do not ask me why, I did not create them.) Another problem is that the files are on a computer running Windows.

On Windows, when my script encounters one of these characters, it assumes that it is at the end of the file and stops processing. For various reasons, I am not allowed to copy files to another machine. But I still need to process them.

Here are my ideas:

Read the file in binary mode, throwing out bytes equal to chr(26) . This will work, but it will take about forever.
Use something like sed to exclude EOF characters. Unfortunately, as far as I can tell, sed for Windows has the same problem and will quit when it sees EOF.
Use some Notepad program and search and replace. But it turns out that Notepad -type programs do a good job with 15 gigabyte files.

My IDEAL solution would be some way to just read the file as text and just ignore the Ctrl-Z characters. Is there any reasonable way to do this?

+6

python windows text sed eof

Joel Dec 20 '13 at 2:29

source share

1 answer

Tim peters · Accepted Answer · 2013-12-20T02:53:36+0000

It is easy to use Python to remove DOS EOF characters; eg,

 def delete_eof(fin, fout): BUFSIZE = 2**15 EOFCHAR = chr(26) data = fin.read(BUFSIZE) while data: fout.write(data.translate(None, EOFCHAR)) data = fin.read(BUFSIZE) import sys ipath = sys.argv[1] opath = ipath + ".new" with open(ipath, "rb") as fin, open(opath, "wb") as fout: delete_eof(fin, fout)

This takes the file path as the first argument and copies the file, but without chr(26) bytes, into the same file path with the addition of .new . Try to taste.

By the way, are you sure that DOS EOF characters are your only problem? It is difficult to imagine a reasonable way in which they can appear in files intended for processing text files.

How to handle huge text files containing EOF / Ctrl-Z characters using Python on Windows?

More articles: