Just ran a quick test with an 8 milliliter linear file (uptime lines) to run the file length and split the file in half. Basically, one pass to get the number of lines, the second pass to record with a section.
On my system, the time taken to complete the first run was about 2-3 seconds. The total time was less than 21 seconds to complete the run and write the split file (s).
Did not execute lamba functions in OP message. The code used below:
#!/usr/bin/env python import sys import math infile = open("input","r") linecount=0 for line in infile: linecount=linecount+1 splitpoint=linecount/2 infile.close() infile = open("input","r") outfile1 = open("output1","w") outfile2 = open("output2","w") print linecount , splitpoint linecount=0 for line in infile: linecount=linecount+1 if ( linecount <= splitpoint ): outfile1.write(line) else: outfile2.write(line) infile.close() outfile1.close() outfile2.close()
No, it will not win either a performance test or a code. :) But, except for something else, which is a performance bottleneck, the lambda functions make the file cache in memory and cause a swap problem, or that the lines in the file are extremely long, I donโt understand why it takes 30 minutes to read / split an 8 million line file.
EDIT:
My environment: Mac OS X, storage was one FW800 hard drive. The file was created fresh to avoid the benefits of file system caching.
source share