Estimated sorting (array / vector), predictable runtime

History:

I need to handle several hundred thousand events (getting results), given the tight time limit. The clock is literally ticking, and when the timer is triggered, everything that is done at this point should be reset.

What is not ready at that time is either discarded (depending on the importance metric) or processed during the next time slice (with increasing importance, i.e., adding a constant to the importance metric). Now, ideally, the processor is much faster than necessary, and the entire kit is ready for a long time before the end of the time slice. Unfortunately, the world is rarely perfect, and "hundreds of thousands" become "tens of millions" before you know.

Events are added to the back of the queue (which is really a vector) as they arrive and are processed from the front during the corresponding next quantum (therefore, the program always processes the last quantum input).

However, not all events are equally important. In the event that the available time is not enough, it would be advisable to abandon unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next queue of the quantum queue, but this will add additional load so this is not an ideal solution).

The obvious thing to use would, of course, be the priority queue / heap. Unfortunately, heapifying 100k elements is not exactly free operation (or parallel), and then I end up with objects in some non-obvious and not necessarily cached memory cells, and pulling elements out of the priority queue is not good for parallelizing.
What I really liked is like a vector that has been sorted, or at least "somewhat approximately sorted," which you can subsequently intersect sequentially. This would trivially allow me to create, for example, 12 threads (or any other number, one per processor) that processes, for example. 1/64 of a range (or another size) each, slowly moving from front to end and ultimately discarding / delaying the rest - which will be small events that can be discarded.

Just sorting the entire range with std::sort would be the easiest and easiest solution. However, the time taken to sort the items reduces the time available for actually processing the items within a fixed time budget, and the sorting time is basically a single processor time (and parallel sorting is also not that big). In addition, making the perfect variety (which is not really needed) can lead to the worst difficulty, while approximate sorting should ideally match the optimal one and have a very predictable cost.

TL; DR

So, I'm looking for a way to sort an array / vector only approximately, but quickly and with predictable (or guaranteed) runtime.

The sort key will be a small integer, usually from 10 to 1000. When delayed until the next quantum time, it can increase ("priority increase"), which will be evaluated by a small amount, for example. 100 or 200.

In another question , where should people make an approximate appearance using "subjective comparison" (?) Shell sort . On different sorting demonstration applets, it seems that, at least for the typical random mixing in them, the shell type can really have an “approximate look” that does not look too bad with 3-4 passes according to the data (and at least the reader is strictly consistent). Unfortunately, it seems like it’s just black art to choose break values ​​that work well, and runtime estimates also seem to draw a lot of attention to the crystal ball.

A layout variety with a relatively large shrinkage coefficient (for example, 2 or 3?) Seems tempting, since it visits the memory strictly sequentially (on both taps) and is able to move elements far away at a great distance quickly. Again, judging by the sorting of the demo applets, it seems that 3-4 passes already give a fairly reasonable “rough sorting”.

The MSD radix type comes to mind, although I'm not sure how it will execute the given typical 16/32-bit integers, in which most of the most significant bits are zero! Could you probably make an initial pass to find the most significant bit in the entire set, and then 2-3 actual sort passes?

Is there a better algorithm or a known working approach with one of the algorithms described above?

+6
source share
6 answers

What comes to mind is to sort through the vector and if some event is less important, do not process it, but postpone it. Once the entire vector has been read, look at the events set aside. Of course, you can use several buckets with different priorities. And only links are stored there, you do not want to move megabytes of data. (sent as an answer now at the request of Damon)

+3
source

Use a separate vector for each priority. Then you do not need to sort them.

+3
source

Sounds like a good example where near-sorting algorithms can be useful.

Back a decade, Chazelle has developed a good data structure that works somewhat like a heap. The main difference is the time complexity. It has a constant time for all important operations, for example. insert, delete, find the lowest element, etc.

The focus of this data structure is that it breaks the O (n * log n) complexity barrier, allowing some sort of error in the sort order.

To me, that sounds pretty much what you need. The data structure is called the soft heap and is explained on Wikipedia:

https://en.wikipedia.org/wiki/Soft_heap

There are other algorithms that make some mistake in favor of speed. You will find them if you google for Near Sort Algorithms

If you try this algorithm, please give some feedback on how this works in practice. I really want to hear from you how the algorithm works in practice.

+2
source

It looks like you want to use std::partition : move the part that interests you to the front, and the rest to the back. Its complexity is in O (n) order, but it is cached, so it is probably much faster than sorting.

+1
source

If you have limited “throughput” in event processing (say, 128 Kbit per time std::nth_element ), you can use std::nth_element to select 128K (minus some percentage lost due to this calculation) for the most promising events (when assuming you have a operator< that compares priorities) in O(N) time. Then you process them in parallel, and when you are done, you reorient the rest (again in O(N) time).

 std::vector<Event> events; auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process if (events.size() <= guaranteed_bandwidth) { // let all N workers loose on [begin(events), end(events)) range } else { auto nth = guaranteed_bandwidth * loss_from_nth_element; std::nth_element(begin(events), begin(events) + nth); // let all N workers loose on [begin(events), nth) range // reprioritize [nth, end(events)) range and append to events for next time quantum } 

This ensures that if you reach the threshold of your bandwidth, you will first process the most valuable elements. You can even speed up nth_element by parallelizing a bad person (for example, let each of the N workers compute M * 128K / N of the best elements for small M in parallel, and then do the final merge and another nth_element on M * 128K elements).

The only weakness is that if your system is really overloaded (billions of events, possibly due to some DOS attack), it may take more than the whole quantum to run nth_element (even with quasi- nth_element ), and you really do not process anything. But if the processing time for one event is much longer (say, several thousand cycles) than a comparison of priorities (say, a dozen cycles), this should not happen with regular loads.

NOTE : for performance reasons, of course, it is better to sort the pointers / indexes into the main event vector, this is not shown for brevity.

+1
source

If you have N worker threads, specify each worker thread 1 / Nth of the original unsorted array. The first thing that the employee will do is your approximate algorithm for quickly sorting preferences on it for a particular part of the array. Then each of them can process its array in the order: first, objects with a higher priority are first executed, as well as very cache-friendly. Thus, you are not attempting to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is completely parallelized. Sorting 10 pieces individually is much cheaper than sorting everything.

This will work best if the priorities of the items to be processed are randomly allocated. If you have some kind of order, you will complete the work with a thread inundated or hungry from the high-priority elements for processing.

0
source

All Articles