Sort 50,000,000 numbers

Suppose we need to sort 50,000,000 numbers. suppose the numbers are stored in a file. What is the most efficient algorithm for solving this problem? Parallel sorting algorithm ...

How to do it? Perhaps a useful link)

I can not use the standard algorithm

Therefore, I ask you about methods and algorithms :)

Okay .. I read about parallel pooling ... But this is not clear to me.

, first version

+6
java sorting algorithm parallel-processing
source share
7 answers

On top of my head, merge sort seems to be the best option when it comes to parallelization and distribution , as it uses a split and capture approach. For more google information for "parallel merge sort" and "distributed merge sort".

For single-machine, multiple cores , see. Correctly multi-threaded quicksort or mergesort algo in Java? . If you can use Java 7 fork / join, see: “ Java 7: more concurrency ” and “ Parallelism with Fork / Join in Java 7 ”.

To distribute it across many machines , see Hadoop , it has a distributed merge sort implementation: see MergeSort and MergeSorter . Also interesting: Hadoop sorts petabytes at 16.25 hours and terabytes in 62 seconds

+8
source share

50 million are not particularly large. I would just read them in my memory. Sort them and write. It only takes a few seconds. How fast do you need it? How much do you need it?

On my old lab, it took 28 seconds. If I had more processors, it could be a little faster, but most of the time is spent reading and writing a file (15 seconds), which will not be faster.

One important factor is the size of your cache. The comparison itself is very cheap if the data is in the cache. Since the L3 cache is shared, a single thread is all you need to make full use of it.

public static void main(String...args) throws IOException { generateFile(); long start = System.currentTimeMillis(); int[] nums = readFile("numbers.bin"); Arrays.sort(nums); writeFile("numbers2.bin", nums); long time = System.currentTimeMillis() - start; System.out.println("Took "+time+" secs to sort "+nums.length+" numbers."); } private static void generateFile() throws IOException { Random rand = new Random(); int[] ints = new int[50*1000*1000]; for(int i= 0;i<ints.length;i++) ints[i] = rand.nextInt(); writeFile("numbers.bin", ints); } private static int[] readFile(String filename) throws IOException { DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(filename), 64*1024)); int len = dis.readInt(); int[] ints = new int[len]; for(int i=0;i<len;i++) ints[i] = dis.readInt(); return ints; } private static void writeFile(String name, int[] numbers) throws IOException { DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(name), 64*1024)); dos.writeInt(numbers.length); for (int number : numbers) dos.writeInt(number); dos.close(); } 
+19
source share

For sorting than many items, your best shot is a Merge Sort . These are usually the algorithms used by databases. Although not as fast as Quick Sort , it uses intermediate storage, so you don't need a huge amount of memory to perform sorting.

Also, as pointed out in comments by sje397 and Scott, Merge Sort is very parallel.

+4
source share

It depends a lot on the problem area. For example, if all numbers are positive ints, the best way would be to create an array from 0-MAX_INT, and then just count how many times each number occurs when reading the file, and then print each int using a non-zero count, however, many times he happened. This is O (n) sorting. There is an official name there, but I forgot what it is.

By the way, I was asked this question in an interview with Google. Of the problematic issues, I came up with this solution, and it looks like this was the answer they were looking for. (I refused work because I did not want to move.)

+3
source share

There are not many of them. If they are extended by 10 bytes, for example, it will be an array of 500 MB, it can almost stay on my phone!;) Therefore, I would say that you need to go to Quicksort, if only that.

+2
source share

Do not be afraid of large quantities. in fact, 50,000,000 rooms are not that big. therefore, if the numbers were integers, then each number is 4 bytes in size, so the total memory required for this array is 50,000,000 * 4/1024/1024 = 190.7 megabytes, which is relatively small. Once you get the math, you can move on to QuickSort, which works in O (nLogn). note that the built-in sorting method in .net arrays uses QuickSort, im not sure if this is the case in java either.

sorting 250,000,000 integers on my machine took about 2 minutes, so go :)

+2
source share

50e6 numbers are currently very small, don't make things more complicated than they should be ...

bash$ sort < file > sorted.file

0
source share

All Articles