Sort a large collection when showing progress

Question

Sort a large collection when showing progress

What is the best way to sort the collection when updating the progress bar? I currently have this code:

for (int i = 0; i < items.size(); i++) { progressBar.setValue(i); // Uses Collections.binarySearch: CollectionUtils.insertInOrder(sortedItems, item.get(i)); }

This shows progress, but the progress indicator slows down as the number of items in sortedItems increases. Does anyone have a better approach? Ideally, I would like to use an interface similar to Collections.sort() , so I'm trying to use different sorting algorithms.

Any help would be great!

Like a little background, this code pulls a lot of documents (1-10 million) from Lucene and runs its own comparator on them. Sorting them by writing data to disk will be too slow to be practical. Most of the cost is reading an element from disk, and then starting the comparator over the elements. My computer has a lot of memory, so there are no problems with replacing with a disk, etc.

In the end, I went with Stephen's solution, as it was very clean and allowed me to easily add a multi-threaded sorting algorithm.

+6

java sorting progress-bar

Luke quinane Oct 18 '10 at 1:47

source share

7 answers

Can you use indeterminate progress bar? It still gives some feedback to the user that something is happening. Your code will look like this:

 progessbar.setIndeterminate(true); ArrayList sorted = new ArrayList(items); Colletions.sort(sorted); progessBar.setString("Hey you're done!");

I think that you will get better performance much because of the use of inline sort, and not for sorting the binary insert that you do.

+1

Greg case Oct 18 '10 at 3:41

source share

Why not implement your own merge method (which Collections.sort does) and update the progress bar at key points in the algorithm (say, after each merge, more than 5% of the array)?

+1

Simonc Oct 18 '10 at 5:16

source share

If you are simply comparing the sort time, print the time before and after the sort.

Predicting how long sorting will take place in the wild. For some species, this depends on the input order. I would use i/(double) items.size() to create a relationship of work done and call it a beautiful day. You can refresh the panel every items.size()/100 iterations. There is no reason to spank a poor progress bar with useless updates.

0

Tony ennis Oct 18 '10 at 1:59

source share

The problem is the physical sorting mechanism - as sortedItems increases, insertInOrder will by definition take longer, since this is most likely an O(n lg n) + O(n) operation (using binary search to find the next smallest element and then insert element). Inevitably, as your collection grows, inserting the next item in the right place will take longer.

The only way to approximate the progress bar, the time of which increases linearly, is to use some approximation similar to the inverse function lg , since sorting the first 1000 elements can take a time similar to sorting the last 10 (which, of course, is a generalization).

0

nearlymonolith Oct 18 '10 at 2:10

source share

Maybe I missed something because no one mentioned it, but it looks like the runtime types of your original List object are not developers of RandomAccess , and therefore your call to Collections.binarySearch is executed in O (n) time. This will slow down a little, very noticeable, so that you double the number of items to sort.

Also, if you use, for example, LinkedList for sortedItems , then the insert is also O (n).

If so, then it makes sense that if you go from 1 million to 2 million items, your expected time will also be approximately doubled.

To diagnose which of the 2 List objects is problematic

If the progress bar is slow from the start, it items ; try using a different container, something tree-like or hash y
If the progress bar is slower and slower as it approaches 100%, it sortedItems ; same advice as above.

Note that this may be like a List , causing a slowdown. It also has nothing to do with the progress bar. The problem you described is algorithmic with respect to sorting, not updating the progress bar.

0

Phil Oct 18 '10 at 4:26

source share

One simple approach in the progress bar is this.

You can fix the number of calls to update the move regardless of the size of the element using mod. For example,

 public void run(int total) { int updateInterval = total / 10; System.out.println("interval = " + updateInterval); for(int i = 0; i < total; i++) { if(i % updateInterval == 0) { printProgress((float)i / total * 100f); } // do task here } } private void printProgress(float value) { System.out.println(value + "%"); }

This will update the progress bar 10 times (or 9? Check the boundary conditions), whether the size will be 10 or 10 million.

This is just an example, adjust the values accordingly.

0

Adrian m Oct 18 '10 at 9:44

source share

Stephen c · Accepted Answer · 2010-10-18T03:38:19+0000

You want to be careful here. You decided to use an algorithm that gradually creates a sorted data structure so that (I accept) you can display a progress bar. However, you may have chosen a sorting method that is significantly slower than optimal. (Both types will be O(NlogN) , but there is more performance than Big-O behavior ...)

If you are concerned that this might be a problem, compare the sorting time of a typical collection using TreeMap and Collections.sort . The latter works by copying the input collection into an array, sorting the array and copying it back. (It works best if the input collection is an ArrayList. If you do not need the result as a modified collection, you can avoid the final copy using Collection.toArray , Arrays.sort and Arrays.asList .)

An alternative idea would be to use a Comparator object that tracks the number of times it has been called, and use it to track the progress of the sort. You can use the fact that the comparator is usually called about N*log(N) times, although you may need to calibrate it against the actual algorithm used ¹ .

By the way, counting calls to the comparator will give you a better idea of the progress that you will get by counting the inserts. As you approach the end of the sort, the deceleration rate slows down.

(You will have different threads reading and writing the counter, so you need to consider synchronization. Declaring the counter as volatile will work due to additional memory traffic. You can also just ignore the problem if you are happy that the progress bar sometimes shows outdated values. .. depending on your platform, etc.)

^{1 - The problem with this.} ^{There are some algorithms in which the number of comparisons can vary greatly depending on the initial data sorting order.} ^{For such an algorithm, it is impossible to calibrate the counter, which will work in "not average" cases.}

Sort a large collection when showing progress

More articles: