Proposed approach
Let me bring NumPy magic to the table! Ok, we will use np.maximum.accumulate .
Explanation
To see how maximum.accumulate can help us, suppose we group groups one by one.
Consider the grouby sample:
grouby column : [0, 0, 0, 1, 1, 2, 2, 2, 2, 2]
Consider the approximate value:
value column : [3, 1, 4, 1, 3, 3, 1, 5, 2, 4]
Using maximum.accumulate simply on value will not give us the desired result, since we need to make these accumulations only within each group. To do this, one trick would be to shift each group from the group in front of it.
There may be several methods for doing this offset work. One simple way would be to offset each group with an offset of max value + 1 greater than the previous one. For the sample, this offset will be 6 . So, for the second group we will add 6 , the third - as 12 and so on. So the modedied value will be -
value column : [3, 1, 4, 7, 9, 15, 13, 17, 14, 16]
Now we can use maximum.accumulate , and clusters will be limited within each group -
value cummaxed: [3, 3, 4, 7, 9, 15, 15, 17, 17, 17])
To return to the original values, subtract the offsets that were added earlier.
value cummaxed: [3, 3, 4, 1, 3, 3, 3, 5, 5, 5])
This is our desired result!
At the beginning, we suggested that the groups be consistent. To get data in this format, we will use np.argsort(groupby,kind='mergesort') to get the sorted indexes so that it keeps order for the same numbers, and then uses these indexes to index in the groupby column.
To take into account the elements of the negative group, we just need to compensate for max() - min() , and not just max() .
Thus, the implementation will look something like this:
def argsort_unique(idx):
Checking and checking runtime
Check
1) Grouping as ints:
In [58]:
2) Swimming grouping:
In [10]:
Dates -
1) Grouping as int (same as setting used for timings in question):
In [24]: LENGTH = 100000 ...: g = np.random.randint(0,LENGTH//2,(LENGTH))/10.0 ...: v = np.random.rand(LENGTH) ...: In [25]: %timeit numpy(g, v)
2) Swimming grouping:
In [29]: LENGTH = 100000 ...: g = np.random.randint(0,LENGTH//2,(LENGTH))/10.0 ...: v = np.random.rand(LENGTH) ...: In [30]: %timeit pir1(g, v)