A quick method in Python to split a large text file using the number of lines as an input variable

I am splitting a text file using the number of lines as a variable. I wrote this function to save spitted files in a temporary directory. Each file has 4 million lines waiting for the last file.

import tempfile from itertools import groupby, count temp_dir = tempfile.mkdtemp() def tempfile_split(filename, temp_dir, chunk=4000000): with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k)) for line in group: with open(output_name, 'a') as outfile: outfile.write(line) 

The main problem is the speed of this function. To split a single file of 8 million lines in two files into 4 million lines, the time is longer than more than 30 minutes of my window OS and Python 2.7

+4
source share
4 answers
  for line in group: with open(output_name, 'a') as outfile: outfile.write(line) 

opens a file and writes one line, for each line in the group. It is slow.

Instead, write once for each group.

  with open(output_name, 'a') as outfile: outfile.write(''.join(group)) 
+6
source

Just ran a quick test with an 8 milliliter linear file (uptime lines) to run the file length and split the file in half. Basically, one pass to get the number of lines, the second pass to record with a section.

On my system, the time taken to complete the first run was about 2-3 seconds. The total time was less than 21 seconds to complete the run and write the split file (s).

Did not execute lamba functions in OP message. The code used below:

 #!/usr/bin/env python import sys import math infile = open("input","r") linecount=0 for line in infile: linecount=linecount+1 splitpoint=linecount/2 infile.close() infile = open("input","r") outfile1 = open("output1","w") outfile2 = open("output2","w") print linecount , splitpoint linecount=0 for line in infile: linecount=linecount+1 if ( linecount <= splitpoint ): outfile1.write(line) else: outfile2.write(line) infile.close() outfile1.close() outfile2.close() 

No, it will not win either a performance test or a code. :) But, except for something else, which is a performance bottleneck, the lambda functions make the file cache in memory and cause a swap problem, or that the lines in the file are extremely long, I donโ€™t understand why it takes 30 minutes to read / split an 8 million line file.

EDIT:

My environment: Mac OS X, storage was one FW800 hard drive. The file was created fresh to avoid the benefits of file system caching.

+1
source

You can use tempfile.NamedTemporaryFile directly in the context manager:

 import tempfile import time from itertools import groupby, count def tempfile_split(filename, temp_dir, chunk=4*10**6): fns={} with open(filename, 'r') as datafile: groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk) for k, group in groups: with tempfile.NamedTemporaryFile(delete=False, dir=temp_dir,prefix='{}_'.format(str(k))) as outfile: outfile.write(''.join(group)) fns[k]=outfile.name return fns def make_test(size=8*10**6+1000): with tempfile.NamedTemporaryFile(delete=False) as fn: for i in xrange(size): fn.write('Line {}\n'.format(i)) return fn.name fn=make_test() t0=time.time() print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0 

On my computer, the tempfile_split part works after 3.6 seconds. This is OS X.

+1
source

If you are in a linux or unix environment, you can trick a bit and use the split command from within python. This is a trick for me, and very fast too:

 def split_file(file_path, chunk=4000): p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path, os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.communicate() # Remove the original file if required try: os.remove(file_path) except OSError: pass return True 
0
source

All Articles