External shuffling: shuffling large amounts of data from memory

Question

External shuffling: shuffling large amounts of data from memory

I am looking for a way to shuffle a large amount of data that does not fit into memory (about 40 GB).

I have about 30 million records of variable length, which are stored in one large file. I know the start and end positions of each entry in this file. I need to shuffle this data that does not fit in RAM.

The only solution I thought was to shuffle an array containing numbers from 1 to N , where N is the number of records with Fisher-Yates , and then copy the records to a new file according to this order. Unfortunately, this solution involves many search operations and therefore will be very slow.

Is there a better solution for shuffling large amounts of data with uniform distribution?

+4

java algorithm bigdata

Edouard Jan 31 '13 at 14:25

source share

4 answers

OldCurmudgeon · Answer 1 · 2013-01-31T14:41:05+0000

First you get a shuffle problem from the face. Do this by inventing a hash algorithm for your records that produces random results, and then do the usual external sorting in a hash.

Now you turned your shuffle into sort , your problems turned into finding an effective external sorting algorithm that fits your pocket and memory limits. Now it will be as easy as google .

Patricia Shanahan · Answer 2 · 2013-01-31T14:42:14+0000

I suggest sticking with your general approach, but inverting the map before making the actual copy. This way you read sequentially and make scattered notes, not the other way around.

Reading should be done on request before the program can continue. A record can be left in the buffer, increasing the likelihood of accumulating more than one record on the same disk block before recording.

Spatarel · Answer 3 · 2013-02-02T01:01:40+0000

Premise

From what I understand, using the Fisher-Yates algorithm and the data that you have about the positions of the records, you should be able to get (and calculate) the list:

 struct Entry { long long sourceStartIndex; long long sourceEndIndex; long long destinationStartIndex; long long destinationEndIndex; }

Problem

From now on, the naive solution is to search for each record in the source file, read it, and then search for the new record position in the target file and write it.

The problem with this approach is that it uses too many queries.

Decision

The best way to do this is to reduce the number of requests using two huge buffers for each of the files.

I recommend a small buffer for the source file (say, 64 MB) and a large buffer for the target file (as large as the user can afford - say, 2 GB).

Initially, the destination buffer will be mapped to the first 2 GB of the destination file. At this point, read the entire source file in pieces of 64 MB in the source buffer. When you read it, copy the corresponding entries to the destination buffer. When you get to the end of the file, the output buffer should contain all the necessary data. Write it to the destination file.

Then map the output buffer to the next 2 GB of the destination file and repeat the procedure. Continue until you write the entire output file.

Attention

Since the records are arbitrary sizes, it is very likely that you will have suffixes and prefixes for the records at the beginning and end of the buffers, so you need to make sure that you copied the data correctly!

Estimated Cost of Time

The execution time depends mainly on the size of the source file, the RAM available for the application, and the speed of reading on the hard disk. Assuming that the file is 40 GB in size, 2 GB of RAM and the read speed on the hard disk is 200 MB / s, the program will need to read 800 GB of data (40 GB * (40 GB / 2 GB)). Assuming the hard drive is not very fragmented, the time taken to search will be negligible. This means that reading will take one hour! But if, fortunately, the user has 8 GB of RAM available for your application, the time can only be reduced to 15-20 minutes.

I hope this will be enough for you, because I do not see another faster way.

TheWhiteRabbit · Answer 4 · 2013-01-31T14:27:30+0000

Logically separate database records (e.g. alphabetically)
Creating Indexes Based on Your Created Partitions
create a DAO to increase index sensitivity

External shuffling: shuffling large amounts of data from memory

Premise

Problem

Decision

Attention

Estimated Cost of Time

More articles: