How to shuffle a text file on disk in Python

I work with a text file about 12 * 10 ^ 6 lines in size, which is stored on my hard drive. File structure:

data|data|data|...|data\n data|data|data|...|data\n data|data|data|...|data\n ... data|data|data|...|data\n 

There is no header, and no identifier to uniquely identify strings.

Since I want to use it for machine learning purposes, I need to make sure that there is no order in the text file that can affect stochastic learning.

I usually load such files into memory, and I shuffle them before overwriting them to disk. Unfortunately, this time it’s not possible, due to the file size, so I need to control the shuffle directly on the disk (suppose I have no problems with disk space). Any idea on how efficiently (with the least possible complexity, i.e. Burn to disk), manages such a task using Python?

0
python shuffle text-files bigdata
source share
4 answers

All but one of these ideas use O (N) memory, but if you use array.array or numpy.ndarray we are talking about N * 4 bytes, which is significantly less than the whole file. (I just use a simple list, if you need help converting to a more compact type, I can show it too.)


Using a temporary database and index list:

 with contextlib.closing(dbm.open('temp.db', 'n')) as db: with open(path) as f: for i, line in enumerate(f): db[str(i)] = line linecount = i shuffled = random.shuffle(range(linecount)) with open(path + '.shuffled', 'w') as f: for i in shuffled: f.write(db[str(i)]) os.remove('temp.db') 

These are 2N single-line disk operations and 2N single-dBm key operations, which should be 2NlogN single-disk operations equivalent to operations, so the total complexity is O (NlogN).


If you use a relational database like sqlite3 instead of dbm, you don’t even need an index list because you can just do this:

 SELECT * FROM Lines ORDER BY RANDOM() 

This has the same temporal complexity as above, and the complexity of the space is O (1) instead of the O (N) theory. In practice, you need a DBMS that can feed you a row at a time from a set of 100M rows without saving that 100M on each side.


Another option, without using a temporary database, is in O (N ** 2) theory, but in practice, perhaps faster if you need enough memory for the line cache:

 with open(path) as f: linecount = sum(1 for _ in f) shuffled = random.shuffle(range(linecount)) with open(path + '.shuffled', 'w') as f: for i in shuffled: f.write(linecache.getline(path, i)) 

Finally, by doubling the size of the index list, we can exclude temporary storage on disk. But in practice, this can be much slower because you are doing a lot more random access reading which drives are not so good.

 with open(path) as f: linestarts = [f.tell() for line in f] lineranges = zip(linestarts, linestarts[1:] + [f.tell()]) shuffled = random.shuffle(lineranges) with open(path + '.shuffled', 'w') as f1: for start, stop in shuffled: f.seek(start) f1.write(f.read(stop-start)) 
+6
source share

This suggestion is based on my comment above. He relies on the fact that compressed lines can still fit into memory. If not, other solutions will be required.

 import zlib from random import shuffle def heavy_shuffle(filename_in, filename_out): with open(filename_in, 'r') as f: zlines = [zlib.compress(line, 9) for line in f] shuffle(zlines) with open(filename_out, 'w') as f: for zline in zlines: f.write(zlib.decompress(zline) + '\n') 

My experience is that zlib is fast and bz2 offers better compression, so you can compare.

Also, if you can get away with chunking, say, n lines together, then this is likely to raise the compression ratio.


I was curious about the likelihood of useful compression, so here is an IPython experiment. I don’t know what your data looks like, so I just went with floats (like strings) rounded to 3 places and pulled together with pipes:

The best-case scenario (for example, many lines have the same numbers):

 In [38]: data = '0.000|'*200 In [39]: len(data) Out[39]: 1200 In [40]: zdata = zlib.compress(data, 9) In [41]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data) zlib compression ratio: 0.98 In [42]: bz2data = bz2.compress(data, 9) In [43]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data) bz2 compression ratio: 0.959166666667 

As expected, the best option is really good,> 95% compression ratio.

In the worst case (randomized data):

 In [44]: randdata = '|'.join(['{:.3f}'.format(x) for x in np.random.randn(200)]) In [45]: zdata = zlib.compress(randdata, 9) In [46]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data) zlib compression ratio: 0.5525 In [47]: bz2data = bz2.compress(randdata, 9) In [48]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data) bz2 compression ratio: 0.5975 

Surprisingly, the worst case is not so bad ~ 60% compression ratio, but it can be problematic if you have only 8 GB of memory (60% of 15 GB - 9 GB).

+2
source share

This problem can be seen as the problem of efficiently managing memory pages to reduce paging file I / O. Let your buf buffer be a list of adjacent fragments of the file that you want to save to the output file. Let the copied fragment of the file be a list of a fixed number of whole lines.

Now create a random sequence and reassign the return values ​​to approximate the block numbers and line offsets within this fragment.

This operation leaves you with a sequence of numbers [1..num of chunks] , which can be described as a sequence of accesses to pieces of memory contained on pages of numbers between [1..num of chunks] . For the online version (for example, in a real OS) there is no optimal strategy for this problem, but since you know the actual sequence of links to pages, there is an optimal solution that can be found.

What are the benefits of this approach? The pages that will be used most often are the least read from the hard drive, which means less I / O to read data. In addition, given that the size of your piece is large enough to minimize page scraps compared to the amount of memory, many moments following the lines of the output file will be taken from the same fragment stored in memory (or any other but not replaced by a drive), but not reread from disk.

This may not be the easiest solution (although the optimal page-swapping algorithm is easy to write), it can be a fun exercise, right?

0
source share

Assuming that disk space is not a problem with you, I am creating several files to store data.

 import random import os PMSize = 100 #Lesser value means using more primary memory shuffler = lambda x: open(x, 'w') shufflers = [shuffler('file'+str(x)) for x in range(PMSize)] with open('filename') as file: for line in file: i = random.randint(0, len(shufflers)-1) shufflers[i].write(line) with open('filename', 'w') as file: for file in shufflers: newfile.write(file.read()) for file in shufflers: os.remove(file) 

Your memory complexity will be controlled by PMSize. The complexity of time will be around O (N + PMSize).

0
source share

All Articles