Efficient way to split a large text file in python

this is a previous question , where to improve time performance in a python function, I need to find an efficient way to split my text file

I have the following text file (over 32 GB) not sorted

.................... 0 274 593869.99 6734999.96 121.83 1, 0 273 593869.51 6734999.92 121.57 1, 0 273 593869.15 6734999.89 121.57 1, 0 273 593868.79 6734999.86 121.65 1, 0 272 593868.44 6734999.84 121.65 1, 0 273 593869.00 6734999.94 124.21 1, 0 273 593868.68 6734999.92 124.32 1, 0 274 593868.39 6734999.90 124.44 1, 0 275 593866.94 6734999.71 121.37 1, 0 273 593868.73 6734999.99 127.28 1, ............................. 

the first and second columns are the identifier (ex: 0 -273) of the location of the point x, y, z in the grid.

 def point_grid_id(x,y,minx,maxy,distx,disty): """give id (row,col)""" col = int((x - minx)/distx) row = int((maxy - y)/disty) return (row, col) 

the (minx, maxx) is the beginning of my grid with the size distx,disty . The number of tiles Id

 tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)] n = len(tiles_id) 

I need to slice a 32 GB file in the n (= len(tiles_id)) number of files.

I can do this without sorting, but n times the file is n. For this reason, I want to find an effective separation method for the start form of the file (0,0) (= tiles_id[0]) . After that, I can only read split files once.

+7
source share
2 answers

Sorting is hardly possible for a 32 gigabyte file, regardless of whether you use Python or the command line tool ( sort ). Databases seem too powerful, but can be used. However, if you do not want to use databases, I would suggest simply splitting the source file in files using the tile identifier.

You read the line, write out the file name from the fragment identifier, and add the line to the file. And continue this until the source file is complete. It will not be too fast, but at least it has O (N) complexity, unlike sorting.

And, of course, individual sorting of files and their concatenation are possible. The main bottleneck in sorting a 32 GB file should be memory, not CPU.

Here it is, I think:

 def temp_file_name(l): id0, id1 = l.split()[:2] return "tile_%s_%s.tmp" % (id0, id1) def split_file(name): ofiles = {} try: with open(name) as f: for l in f: if l: fn = temp_file_name(l) if fn not in ofiles: ofiles[fn] = open(fn, 'w') ofiles[fn].write(l) finally: for of in ofiles.itervalues(): of.close() split_file('srcdata1.txt') 

But if there are many fragments, more than the number of files that you can open, you can do this:

 def split_file(name): with open(name) as f: for l in f: if l: fn = temp_file_name(l) with open(fn, 'a') as of: of.write(l) 

And the most perfectionist way is to close some files and delete them from the dictionary after reaching the limit on the number of open files.

+5
source

A quick google will lead me to this recipe in ActiveState code. He did not give any performance comparison, but seemed to complete the task.

In short, this is similar to what @Ellioh suggested, and you have a ready-made recipe, and you may not have to reinvent the wheel.

+1
source

All Articles