A bit late, but the fastest way I found is to sort the insert . Reducing the complexity and divergence of shaders is key. Bitonic and bubble work well for small numbers. When you get up at around 100, switch to sorting.
Since you know the number of things to sort (9), sorting the net is the best choice. You can use this handy tool to generate it ...
There are 27 comparators in this network, grouped into 11 parallel operations. [[0,1],[2,3],[4,5],[7,8]] [[0,2],[1,3],[6,8]] [[1,2],[6,7],[5,8]] [[4,7],[3,8]] [[4,6],[5,7]] [[5,6],[2,7]] [[0,5],[1,6],[3,7]] [[0,4],[1,5],[3,6]] [[1,4],[2,5]] [[2,4],[3,5]] [[3,4]]
A convenient way to use this declaration is to declare a comparison and replacement macro ...
Combining both approaches, sorting the network to quickly sort small blocks of data, and then merging the sort if you have many blocks works very well, as described in Quick Sort for Exact OIT complex scenes (disclaimer: I'm the author). Scrolling loops (essentially creating a sorting network) can be especially useful, which allows you to sort the registers. Dynamically indexed arrays are placed in local memory slowly. To prevent the compiler from doing this, you can manually declare vec4 array0, array1 ... Macros can concatenate text, which is useful here #define CMP(a, b) (array##a < array##b) . The example is pretty ugly but quick here .