History:
I need to handle several hundred thousand events (getting results), given the tight time limit. The clock is literally ticking, and when the timer is triggered, everything that is done at this point should be reset.
What is not ready at that time is either discarded (depending on the importance metric) or processed during the next time slice (with increasing importance, i.e., adding a constant to the importance metric). Now, ideally, the processor is much faster than necessary, and the entire kit is ready for a long time before the end of the time slice. Unfortunately, the world is rarely perfect, and "hundreds of thousands" become "tens of millions" before you know.
Events are added to the back of the queue (which is really a vector) as they arrive and are processed from the front during the corresponding next quantum (therefore, the program always processes the last quantum input).
However, not all events are equally important. In the event that the available time is not enough, it would be advisable to abandon unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next queue of the quantum queue, but this will add additional load so this is not an ideal solution).
The obvious thing to use would, of course, be the priority queue / heap. Unfortunately, heapifying 100k elements is not exactly free operation (or parallel), and then I end up with objects in some non-obvious and not necessarily cached memory cells, and pulling elements out of the priority queue is not good for parallelizing.
What I really liked is like a vector that has been sorted, or at least "somewhat approximately sorted," which you can subsequently intersect sequentially. This would trivially allow me to create, for example, 12 threads (or any other number, one per processor) that processes, for example. 1/64 of a range (or another size) each, slowly moving from front to end and ultimately discarding / delaying the rest - which will be small events that can be discarded.
Just sorting the entire range with std::sort would be the easiest and easiest solution. However, the time taken to sort the items reduces the time available for actually processing the items within a fixed time budget, and the sorting time is basically a single processor time (and parallel sorting is also not that big). In addition, making the perfect variety (which is not really needed) can lead to the worst difficulty, while approximate sorting should ideally match the optimal one and have a very predictable cost.
TL; DR
So, I'm looking for a way to sort an array / vector only approximately, but quickly and with predictable (or guaranteed) runtime.
The sort key will be a small integer, usually from 10 to 1000. When delayed until the next quantum time, it can increase ("priority increase"), which will be evaluated by a small amount, for example. 100 or 200.
In another question , where should people make an approximate appearance using "subjective comparison" (?) Shell sort . On different sorting demonstration applets, it seems that, at least for the typical random mixing in them, the shell type can really have an “approximate look” that does not look too bad with 3-4 passes according to the data (and at least the reader is strictly consistent). Unfortunately, it seems like it’s just black art to choose break values that work well, and runtime estimates also seem to draw a lot of attention to the crystal ball.
A layout variety with a relatively large shrinkage coefficient (for example, 2 or 3?) Seems tempting, since it visits the memory strictly sequentially (on both taps) and is able to move elements far away at a great distance quickly. Again, judging by the sorting of the demo applets, it seems that 3-4 passes already give a fairly reasonable “rough sorting”.
The MSD radix type comes to mind, although I'm not sure how it will execute the given typical 16/32-bit integers, in which most of the most significant bits are zero! Could you probably make an initial pass to find the most significant bit in the entire set, and then 2-3 actual sort passes?
Is there a better algorithm or a known working approach with one of the algorithms described above?