Numpy: is it possible to use numpy and ndarray to replace the loop in this piece of code?

I am looking for a more reasonable and better solution.

I want to apply different scaling factors to a number field based on the contents of the label. Hopefully the following code can show what I'm trying to achieve:

PS = [('A', 'LABEL1', 20), ('B', 'LABEL2', 15), ('C', 'LABEL3', 120), ('D', 'LABEL1', 3),] FACTOR = [('LABEL1', 0.1), ('LABEL2', 0.5), ('LABEL3', 10)] d_factor = dict(FACTOR) for p in PS: newp = (p[0], p[1], p[2]*d_factor[p[1]]) print newp 

This is a very trivial operation, but I need to execute it on a data set of at least one million rows.

So of course, the faster the better.

Factors will be known in advance, and their number will be no more than 20-30.

  • Is there any matrix or linalg trick we can use?

  • Can ndarray take a text value in a cell?

+4
source share
3 answers

If you want to mix data types, you will want structured arrays .

If you need an index of matching values ​​in a search array, you want searchsorted

Your example looks like this:

 >>> import numpy as np >>> PS = np.array([ ('A', 'LABEL1', 20), ('B', 'LABEL2', 15), ('C', 'LABEL3', 120), ('D', 'LABEL1', 3),], dtype=('a1,a6,i4')) >>> FACTOR = np.array([ ('LABEL1', 0.1), ('LABEL2', 0.5), ('LABEL3', 10)],dtype=('a6,f4')) 

Your structured arrays:

 >>> PS array([('A', 'LABEL1', 20), ('B', 'LABEL2', 15), ('C', 'LABEL3', 120), ('D', 'LABEL1', 3)], dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')]) >>> FACTOR array([('LABEL1', 0.10000000149011612), ('LABEL2', 0.5), ('LABEL3', 10.0)], dtype=[('f0', '|S6'), ('f1', '<f4')]) 

And you can access individual fields like this (or you can give them names, see docs):

 >>> FACTOR['f0'] array(['LABEL1', 'LABEL2', 'LABEL3'], dtype='|S6') 

How to search for a FACTOR on a PS (FACTOR must be sorted):

 >>> idx = np.searchsorted(FACTOR['f0'], PS['f1']) >>> idx array([0, 1, 2, 0]) >>> FACTOR['f1'][idx] array([ 0.1, 0.5, 10. , 0.1], dtype=float32) 

Now just create a new array and multiply it:

 >>> newp = PS.copy() >>> newp['f2'] *= FACTOR['f1'][idx] >>> newp array([('A', 'LABEL1', 2), ('B', 'LABEL2', 7), ('C', 'LABEL3', 1200), ('D', 'LABEL1', 0)], dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')]) 
+4
source

If you compare two numpy arrays, you get the corresponding indexes. You can use these indexes to perform collective operations. This is probably not the fastest modification, but it is simple and straightforward. If PS needs to have the structure you are showing, you can use your own dtype type and have an Nx3 array.

 import numpy as np col1 = np.array(['a', 'b', 'c', 'd']) col2 = np.array(['1', '2', '3', '1']) col3 = np.array([20., 15., 120., 3.]) factors = {'1': 0.1, '2': 0.5, '3': 10, } for label, fac in factors.iteritems(): col3[col2==label] *= fac print col3 
+1
source

I don't think numpy can help you with this. BTW, this is ndarray , not nparray ...

Perhaps you could do this with a generator. See http://www.dabeaz.com/generators/index.html

0
source

All Articles