Sort a file with a huge amount of data, taking into account memory limitations

Points:

  • We process thousands of flat files per day, simultaneously.
  • Memory limitation is a serious problem.
  • We use a thread for each file process.
  • We do not sort by columns. Each row (record) in the file is considered as one column.

Unable to execute:

  • We cannot use unix / linux collation commands.
  • We cannot use any database system no matter how light it can be.

Now we can’t just load everything into the collection and use the sorting mechanism. He will eat all the memory and the program will get a heap error.

In this situation, how would you sort the entries / lines in a file?

+24
java sorting file
Jan 18 '10 at 16:22
source share
12 answers

It looks like you are looking for external sorting .

Basically, you first sort small chunks of data, write them back to disk, and then iterate over them for sorting.

+38
Jan 18 '10 at 16:27
source share

You can read files in smaller parts, sort them and write to temporary files. Then you read the two of them again and merge them with a large temporary file and so on. If only one is left, you can sort your file. Basically, the Megresort algorithm is executed for external files. It scales well with large large files, but causes some additional I / O.

Edit: if you have some information about the likely dispersion of lines in your files, you can use a more efficient algorithm (sort distribution). Simplified, you have to read the source file once and write each line to a temporary file that takes only lines with the same first char (or a specific range of first characters). Then you iterate over all (now small) temporary files in ascending order, sort them in memory and add them directly to the output file. If the temporary file is too large for sorting in memory, you can reuse the same process for this based on 2nd char in lines and so on. Therefore, if your first partition was good enough to create small enough files, you will only have 100% I / O load, no matter how big the file is, but in the worst case it can become much larger than with a stable stable merge sort.

+9
Jan 18
source share

Despite your limitations, I would use the built-in SQLITE3 database. Like me, I work weekly with 10-15 million flat file strings, and import and generate sorted data very quickly, and you only need a little free executable (sqlite3.exe). For example: after loading the .exe at the command line, you can do this:

 C:> sqlite3.exe dbLines.db sqlite> create table tabLines(line varchar(5000)); sqlite> create index idx1 on tabLines(line); sqlite> .separator '\r\n' sqlite> .import 'FileToImport' TabLines 

then

 sqlite> select * from tabLines order by line; or save to a file: sqlite> .output out.txt sqlite> select * from tabLines order by line; sqlite> .output stdout 
+9
Jan 18
source share

I would deploy an EC2 cluster and run Hadoop MergeSort .

Edit : not sure how many details you need, or what. EC2 is Amazon Elastic Compute Cloud - it allows you to rent virtual servers for hours at a low price. Here is their site .

Hadoop is an open source MapReduce platform designed to process large datasets in parallel. A job is a good candidate for MapReduce when it can be divided into subsets that can be processed individually, and then merged together, usually by sorting by key (i.e., Separation and Quiescent Strategies). Here is his website .

As mentioned in other posters, external sorting is also a good strategy. I think I would decide that the size of the data and the speed requirements depend between them. A single machine will probably be limited to processing one file at a time (as you will use the available memory). So consider something like EC2 only if you need to process files faster than that.

+7
Jan 18 '10 at 16:25
source share

As already mentioned, you can process it step by step. I would like to explain this in my own words (different in paragraph 3):

  • Read the file sequentially, process N records at a time in memory (N randomly, depending on your memory limit and the number of T temporary files you want).

  • Sort N records in memory, write them to temp file. Loop on T until you finish.

  • Open all T temp files at a time, but read only one entry per file. (Of course, with buffers). For each of these T entries, find the smaller one, write it to the destination file, and advance only in that file.




Benefits:

  • Memory consumption as little as possible.
  • You only use dual disk access compared to the all-in-memory policy. Not bad!: -)



Example with numbers:

  • The original file with 1 million records.
  • Choose to have 100 temp files, so read and sort 10,000 records at a time, and drop them in your own temporary file.
  • Open 100 temporary files at a time, read the first entry in memory.
  • Compare the first entries, write less and promote this temporary file.
  • The loop in step 5, a million times.



EDITED

You mentioned a multithreaded application, so I wonder ...

As we saw from these discussions on this need, using less memory gives less performance, which in this case has a dramatic factor. Therefore, I could also suggest using only one thread to process only one type at a time, and not as a multi-threaded application.

If you process ten threads, each of which has a tenth of the available memory, your performance will be miserable, much less than a tenth of the initial time. If you use only one thread and queue 9 other requirements and process them one at a time, global performance will be much better; you will complete ten tasks much faster.




After reading this answer: Sorting a file with a huge amount of data taking into account memory limitations I suggest you consider this distribution. This can be a huge win in your context.

The improvement of my suggestion is that you do not need to open all temporary files at the same time, you open only one of them. It saves your day !:-)

+5
Jan 18
source share

If your limitation is only not to use an external database system, you can try the built-in database (e.g. Apache Derby ). Thus, you get all the benefits of a database without any external infrastructure dependencies.

+2
Jan 18
source share

You can use the following strategy of division and subjugation:

Create an H () function that can assign a number to each entry in the input file. For r2, which will be sorted after r1, it must return a larger number for r2 than for r1. Use this function to split all records into separate files that will fit into memory so you can sort them. Once you do this, you can simply concatenate the sorted files to get one large sorted file.

Suppose you have this input file, where each line represents an entry

 Alan Smith Jon Doe Bill Murray Johnny Cash 

Let's just build H () so that it uses the first letter in the entry so that you can get up to 26 files, but in this example you just get 3:

 <file1> Alan Smith <file2> Bill Murray <file10> Jon Doe Johnny Cash 

Now you can sort each individual file. To replace "John Doe" and "Johnny Cash" in <file10>. Now, if you just merge 3 files, you will have a sorted version of the input.

Please note that you divide first and only win (sort) later. However, you are required to perform the separation in such a way that the resulting parts that you need to sort do not overlap, making it easier to combine the result.

The method by which you implement the partition function H () is largely dependent on the nature of your input. After you figure out this part, the rest should be easy.

+2
Jan 18
source share

I know that you mentioned that you do not use the database no matter how light it is ... so maybe this is not an option. But, what about hsqldb in memory ... send it, sort by request, clear it. Just a thought.

0
Jan 18
source share

You can use the SQL Lite db file, load the data into db, and then enable sorting and return the results for you. Advantages: No need to worry about writing a better sorting algorithm. Disadvantage: you need disk space, slow processing. https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

0
Feb 14 '13 at 11:09
source share

Here is a way to do this without much use of sorting in the Java side and without using a database. Assumptions. You have a 1TB space and the files contain or begin with a unique number but are not sorted.

Separate files N times.

Read these N files one by one and create one file for each line / number

Name this file the appropriate number. If the naming contains a counter updated to store the least.

You now have the root folder of the files marked for sorting by name, or pause your program to give you time to run a command on your OS to sort the files by name. You can do this also programmatically.

Now you have a folder with files sorted by their name, using a counter starting with each file one by one, put the numbers in your OUTPUT file, close it.

When you are done, you will have a large file with sorted numbers.

0
May 18 '15 at 2:48
source share

You can do this with only two temporary files - source and destination - and as little memory as possible. In the first step, your original source file, in the last step, the end point is the result file.

At each iteration:

  • reads from the source file into a sliding buffer a piece of data whose size is equal to half the size of the buffer;
  • sort the entire buffer
  • write the first half of the buffer to the target file.
  • move the second half of the buffer to the beginning and repeat

Keep a boolean flag that says whether to move some records in the current iteration. If the flag remains false, your file is sorted. If it is raised, repeat the process using the target file as the source.

Maximum number of iterations: (file size) / (buffer size) * 2

0
Apr 27 '17 at 15:51
source share

If you can move forward / backward in a file (search) and rewrite parts of the file, then you should use a bubble type .

You will need to scan the lines in the file and have only 2 lines in memory at the moment, and then swap them if they are not in the correct order. Repeat the process until the files are replaced.

-3
Jan 18 '10 at 17:15
source share



All Articles