I can not understand the output of numpart argty

I am trying to use arpgpartition from numpy, but it seems that something went wrong and I cannot figure it out. Here's what happens:

These are the first 5 elements of a sorted array of norms

 np.sort(norms)[:5] array([ 53.64759445, 54.91434479, 60.11617279, 64.09630585, 64.75318909], dtype=float32) 

But when I use indices_sorted = np.argpartition(norms, 5)[:5]

 norms[indices_sorted] array([ 60.11617279, 64.09630585, 53.64759445, 54.91434479, 64.75318909], dtype=float32) 

When do I think I should get the same result as a sorted array?

It works fine when I use 3 as the parameter indices_sorted = np.argpartition(norms, 3)[:3]

 norms[indices_sorted] array([ 53.64759445, 54.91434479, 60.11617279], dtype=float32) 

This doesn't make much sense to me, hoping someone can offer some insight?

EDIT: rephrase this question as an argument that preserves the order of k partitioned elements makes sense.

+15
python arrays numpy
source share
2 answers

We need to use a list of indexes that should be stored in sorted order, and not feed kth param as a scalar. So, to keep the sorted character by the first 5 elements instead of np.argpartition(a,5)[:5] , just do -

 np.argpartition(a,range(5))[:5] 

Here is an example to make everything clear -

 In [84]: a = np.random.rand(10) In [85]: a Out[85]: array([ 0.85017222, 0.19406266, 0.7879974 , 0.40444978, 0.46057793, 0.51428578, 0.03419694, 0.47708 , 0.73924536, 0.14437159]) In [86]: a[np.argpartition(a,5)[:5]] Out[86]: array([ 0.19406266, 0.14437159, 0.03419694, 0.40444978, 0.46057793]) In [87]: a[np.argpartition(a,range(5))[:5]] Out[87]: array([ 0.03419694, 0.14437159, 0.19406266, 0.40444978, 0.46057793]) 

Note that argpartition makes sense in terms of performance if we want to get sorted indices for a small subset of elements, say, k number of elements, which is a small part of the total number of elements.

Letโ€™s use a larger dataset and try to get sorted indices for all elements to make the above point clear -

 In [51]: a = np.random.rand(10000)*100 In [52]: %timeit np.argpartition(a,range(a.size-1))[:5] 10 loops, best of 3: 105 ms per loop In [53]: %timeit a.argsort() 1000 loops, best of 3: 893 ยตs per loop 

Thus, to sort all elements, np.argpartition not suitable.

Now, let's say I want to get sorted indexes only for the first 5 elements with this large data set, and also keep order for them -

 In [68]: a = np.random.rand(10000)*100 In [69]: np.argpartition(a,range(5))[:5] Out[69]: array([1647, 942, 2167, 1371, 2571]) In [70]: a.argsort()[:5] Out[70]: array([1647, 942, 2167, 1371, 2571]) In [71]: %timeit np.argpartition(a,range(5))[:5] 10000 loops, best of 3: 112 ยตs per loop In [72]: %timeit a.argsort()[:5] 1000 loops, best of 3: 888 ยตs per loop 

Very helpful here!

+16
source share

Given the problem of incorrectly sorting a subset (top k, top means the first in sort order), there are two built-in solutions: argsort and argpartition cf. @Divakar answer.

However, if performance is a factor, then it may (depending on the size of the data and a subset of interest) be worth resisting the "bait of one line" by argsort another line and applying argsort to the output of argpartition :

 >>> def top_k_sort(a, k): ... return np.argsort(a)[:k] ... >>> def top_k_argp(a, k): ... return np.argpartition(a, range(k))[:k] ... >>> def top_k_hybrid(a, k): ... b = np.argpartition(a, k)[:k] ... return b[np.argsort(a[b])] >>> k = 100 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k}) 8.348663672804832 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k}) 9.869098862167448 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k}) 1.2305558240041137 

argsort is O (n log n), argpartition with a range argument is O (nk) (?), and argpartition + argsort is O (n + k log k)

Therefore, in an interesting mode n >> k >> 1, it is expected that the hybrid method will be the fastest

UPDATE: ND version:

 import numpy as np from timeit import timeit def top_k_sort(A,k,axis=-1): return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] def top_k_partition(A,k,axis=-1): return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] def top_k_hybrid(A,k,axis=-1): B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis) A = np.random.random((100,10000)) k = 100 from timeit import timeit for f in globals().copy(): if f.startswith("top_"): print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100) 

Run Example:

 top_k_sort 63.72379460372031 top_k_partition 99.30561298970133 top_k_hybrid 10.714635509066284 
+5
source share

All Articles