Shuffle a large list of items without loading into memory

I have a file with ~ 2 billion lines of text (~ 200gigs). I want to create a new file containing the same text lines, but randomly shuffling the lines. I can not store all the data in memory. Is there a good way to do this in a python / command line that takes a reasonable amount of time (a couple of days)?

I thought I could touch 50 empty files. Stream through a 2 billion line file and randomly distribute each line into one of 50 empty files. Then cat 50 files. Will there be any serious systematic bias to this method?

+8
python shuffle
source share
4 answers

If you can reserve 16 GB of memory for this program, I wrote a program called sample that shuffles the lines of a file by reading their byte offsets, shuffling the offsets, and then printing the output by searching through the shuffled offsets file. It uses 8 bytes for each 64-bit offset, thus 16 GB for input in two billion lines.

It won't be fast, but on a system with enough memory, sample will shuffle files that are large enough to cause GNU shuf . In addition, it uses the mmap routines to try to minimize the input / output costs of the second pass through your file. It also has several other options; see --help for more details.

By default, this program will be selected without replacement and shuffling on one line. If you want to shuffle with a replacement, or if your input is in FASTA, FASTQ or another multi-line format, you can add some parameters to customize how the selection is performed. (Or you can take an alternative approach, which I attach to in the Perl gist below, but sample addresses these cases.)

If your FASTA sequences are on each of the two lines, that is, they alternate between the sequence header on one line and the sequence data on the next, you can still move with sample and half the memory, since you are only shuffling half the number of offsets. See --lines-per-offset ; you must specify 2 , for example, to shuffle pairs of lines.

In the case of FASTQ files, their records are split every four lines. You can specify --lines-per-offset=4 to shuffle the FASTQ file from the fourth part of the memory needed to shuffle a single-line file.

In addition, I have a gist here , written in Perl, that will display the sequences without replacement from the FASTA file without taking into account the number of lines in the sequence. Note that this is not exactly the same as shuffling the whole file, but you can use this as a starting point as it collects offsets. Instead of fetching some offsets, you should delete line 47, which sorts the shuffled indexes, and then use the file search operations to read through the file using the list of shuffled indexes directly.

Again, this will not be so fast because you go through a very large file out of order, but storing offsets is much cheaper than storing whole lines, and adding mmap routines may help a bit with what is essentially a series of random access operations. And if you work with FASTA, you will have even less offsets for storage, so the use of your memory (with the exception of any relatively insignificant overhead costs for the container and the program) should be no more than 8 GB - and, most likely, less in depending on its structure.

+5
source share

What about:

 import mmap from random import shuffle def find_lines(data): for i, char in enumerate(data): if char == '\n': yield i def shuffle_file(in_file, out_file): with open(in_file) as f: data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) start = 0 lines = [] for end in find_lines(data): lines.append((start, end)) start = end + 1 shuffle(lines) with open(out_file, 'w') as out: for start, end in lines: out.write(data[start:end+1]) if __name__ == "__main__": shuffle_file('data', 'result') 

In this solution, only all offsets of line files in the file should be stored, these are 2 words per line, plus container overhead.

+5
source share

You can check out the HugeFileProcessor tool. This is similar to the @ Alex-Reynolds sample , but should be significantly faster since there would be no searches.

The following is information about implementing shuffling. To do this, specify batchSize - the number of lines to store in RAM when writing to output. The larger the better (if you are not from RAM), because the total shuffling time will be (the number of lines in the source file) / batchSize * (time to read sourceFile completely). Please note that the program shuffles the entire file , not based on the package.

The algorithm is as follows.

  • Count the lines in the source file. This is done by simply reading the entire file in turn. (See Some comparisons here .) It also gives a measurement of how long it takes to read the entire file once. Thus, we could estimate how many times it would be necessary to make a complete shuffle, because this would require a complete reading of Ceil files (linesCount / batchSize).

  • Now that we know the general set of rows, we can create an index array with row size and shuffle it using Fisher-Yates (the so-called orderArray in code). This will give us the order in which we want the lines in the shuffled file. Note that this is a global order throughout the file, not a package or fragment or something.

  • Now the real code. We need to get all the lines from sourceFile in the order we just calculated, but we cannot read the whole file in memory. Therefore, we just shared the task.

    • We go through sourceFile, read all the lines and save only those lines that would be in the first orderArray package array. When we get all these lines, we can write them to outFile in the required order, and this will be done using batchSize / linesCount.
    • Then we repeat the whole process over and over, taking the next parts of orderArray and reading the sourceFile from beginning to end for each part. In the end, the entire orderArray is processed, and we are done.

Why does it work?

Because all we do is just read the source file from start to finish. There is no forward / backward search, as well as what's on the HDD. The file is read in chunks according to the internal HDD buffers, FS blocks, CPU cahce, etc., and everything is read sequentially.

Some numbers

On my machine (Core i5, 16 GB of RAM, Win8.1, Toshiba DT01ACA200 2TB, NTFS HDD) I was able to shuffle a 132 GB file (84,000,000 lines) in about 5 hours using a batch size of 3,500,000 When batch loading 2,000,000 it took about 8 hours. The read speed was about 118,000 lines per second.

+2
source share

I think the simplest thing in your case is recursive shuffling and splitting - shuffling - merging. You define two numbers: the number of files you want to split into one file: N (typically between 32 and 256) and the size at which you can directly shuffle in M memory (typically around 128 Mo). Then you have the pseudo code:

 def big_shuffle(file): if size_of(file) < M : memory_shuffle(file) else: create N files for line in file: write_randomly_to_one_of_the_N_files for sub_file in (N_files): big_shuffle(file) merge_the_N_files_one_line_each 

Since each of the subfiles is shuffled, you should not have an offset.

It will be much smaller than Alex Reynolds solution (because there is a lot of io disk), but the only limitation will be disk space.

+1
source share

All Articles