You can simply index the csr matrix using a list of indexes. First we create a matrix and look at it:
>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]]) <3x4 sparse matrix of type '<type 'numpy.int64'>' with 5 stored elements in Compressed Sparse Row format> >>> print m.toarray() [[0 0 1 0] [4 3 0 0] [3 0 0 8]]
Of course, we can just just take a look at the first line:
>>> m[0] <1x4 sparse matrix of type '<type 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format> >>> print m[0].toarray() [[0 0 1 0]]
But we can also look at the first and third line at once, using the list [0,2] as an index:
>>> m[[0,2]] <2x4 sparse matrix of type '<type 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> >>> print m[[0,2]].toarray() [[0 0 1 0] [3 0 0 8]]
Now you can generate random N indices without repeating (without replacing) with numpy choice :
i = np.random.choice(np.arange(m.shape[0]), N, replace=False)
Then you can grab these indices from the original matrix m :
sub_m = m[i]
To grab them from a list of category lists, you must first create an array, then you can index list i :
sub_c = np.asarray(categories)[i]
If you want to have a list of lists, just use:
sub_c.tolist()
or, if you really have / want a tuple of tuples, I think you need to do this manually:
tuple(map(tuple, sub_c))