Split large files with python

I have some problems trying to split large files (say, about 10 GB). The basic idea is to simply read the lines and group each, say 40,000 lines into a single file. But there are two ways to “read” files.

1) The first one is to immediately read the WHOLE file and turn it into a LIST. But for this you will need to load the WHOLE file into memory, which is painful for a file that is too large. (I think I have asked such questions before) In python, approaches to reading the WHOLE file I tried include:

input1=f.readlines() input1 = commands.getoutput('zcat ' + file).splitlines(True) input1 = subprocess.Popen(["cat",file], stdout=subprocess.PIPE,bufsize=1) 

Well, then I can simply group 40,000 lines into a single file: list[40000,80000] or list[80000,120000] Or the advantage of using a list is that we can easily point to specific lines.

2) The second method is to read line by line; process the line while reading it. These read lines will not be stored in memory. Examples include:

 f=gzip.open(file) for line in f: blablabla... 

or

 for line in fileinput.FileInput(fileName): 

I am sure that for gzip.open this is not a list, but a file object. And it seems that we can only process line by line; then how can I do this “split” job? How can I point to specific lines of a file object?

thanks

+7
source share
5 answers
 NUM_OF_LINES=40000 filename = 'myinput.txt' with open(filename) as fin: fout = open("output0.txt","wb") for i,line in enumerate(fin): fout.write(line) if (i+1)%NUM_OF_LINES == 0: fout.close() fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb") fout.close() 
+11
source

If there is nothing special in the fact that each file has a certain number of lines in the file, the readlines() function also takes the size of a "hint", a parameter that behaves as follows:

If the optional parameter sizehint is specified, it reads that many bytes from the file are enough to complete the line, and returns the lines From this. This is often used to ensure that a large file is read efficiently line by line, but without the need to load the entire file into memory. Only complete rows are returned.

... so you can write this code something like this:

 # assume that an average line is about 80 chars long, and that we want about # 40K in each file. SIZE_HINT = 80 * 40000 fileNumber = 0 with open("inputFile.txt", "rt") as f: while True: buf = f.readlines(SIZE_HINT) if not buf: # we've read the entire file in, so we're done. break outFile = open("outFile%d.txt" % fileNumber, "wt") outFile.write(buf) outFile.close() fileNumber += 1 
+4
source
 chunk_size = 40000 fout = None for (i, line) in enumerate(fileinput.FileInput(filename)): if i % chunk_size == 0: if fout: fout.close() fout = open('output%d.txt' % (i/chunk_size), 'w') fout.write(line) fout.close() 
+3
source

For a 10 gigabyte file, the second approach is definitely the way to go. Here is a brief description of what you need to do:

  • Open the input file.
  • Open the first output file.
  • Read one line from the input file and write it to the output file.
  • Maintaining the number of lines that you wrote in the current output file; as soon as it reaches 40,000, close the output file and open the next one.
  • Repeat steps 3-4 until you reach the end of the input file.
  • Close both files.
+2
source

Obviously, when you are working on a file, you will need to iterate over the contents of the file in some way - regardless of whether you do it manually or you allow the Python API part for you (for example, the read line ()) does not matter. In a big O analysis, this means you spend O (n) time (n is the file size).

But reading the file into memory also requires O (n). Although sometimes we need to read a 10 gigabyte file in memory, your specific problem does not require this. We can iterate over the file directly. Of course, a file object requires space, but we have no reason to store the contents of a file twice in two different forms.

So I would go with my second decision.

0
source

All Articles