Efficiently summarize a small numpy array passed through a ginormous numpy array?

Question

Efficiently summarize a small numpy array passed through a ginormous numpy array?

I want to calculate the indexed weight sum over a large (1,000,000 x 3,000) boolean numpy array. A large array of Boolean arrays is infrequent, but the scales come during the request, and I need answers very quickly, without copying the entire large array or expanding the small weight of the array to the size of a large array.

As a result, there should be an array with 1,000,000 entries, each of which has the sum of the entries in the array of weights corresponding to this string True values.

I was looking for the use of masked arrays, but they seem to require creating an array weight the size of my large boolean array.

The code below gives the correct results, but I cannot afford this copy during the multiplication step. Multiplication is not even required since the array of values is logical, but at least it handles the broadcast properly.

I am new to numpy and love it, but I am going to abandon it for this particular problem. I learned a lot to know in order to stay away from everything that loops in python.

The next step is to write this procedure in C (which has one added that allows me to save memory using bits instead of bytes, a path.)

If one of you, a humane guru, can save me from cython?

from numpy import array, multiply, sum # Construct an example values array, alternating True and False. # This represents four records of three attributes each: # array([[False, True, False], # [ True, False, True], # [False, True, False], # [ True, False, True]], dtype=bool) values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3)) # Construct example weights, one for each attribute: # array([1, 2, 3]) weights = array(range(1, 4)) # Create expensive NEW array with the weights for the True attributes. # Broadcast the weights array into the values array. # array([[0, 2, 0], # [1, 0, 3], # [0, 2, 0], # [1, 0, 3]]) weighted = multiply(values, weights) # Add up the weights: # array([2, 4, 2, 4]) answers = sum(weighted, axis=1) print answers # Rejected masked_array solution is too expensive (and oddly inverts # the results): masked = numpy.ma.array([[1,2,3]] * 4, mask=values)

+8

python numpy matrix

Jesse montrose Apr 19 '12 at 0:36

source share

4 answers

It seems likely that dbaupp's answer is correct. But just for the sake of diversity, here is another solution that saves memory. This will work even for operations that do not have a built-in numpy equivalent.

 >>> values = numpy.array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3)) >>> weights = numpy.array(range(1, 4)) >>> weights_stretched = numpy.lib.stride_tricks.as_strided(weights, (4, 3), (0, 8))

numpy.lib.stride_tricks.as_strided is a great little feature! It allows you to specify shape and strides values that allow small arrays to simulate a much larger array. Watch - there are no four rows; it looks like this:

 >>> weights_stretched[0][0] = 4 >>> weights_stretched array([[4, 2, 3], [4, 2, 3], [4, 2, 3], [4, 2, 3]])

So, instead of passing a huge array to MaskedArray , you can pass less. (But, as you already noticed, numpy disguise works the other way around, you can expect: masks are right, not open, so you need to keep the inverted values .) As you can see, MaskedArray does not copy data; it just reflects everything that is in weights_stretched :

 >>> masked = numpy.ma.MaskedArray(weights_stretched, numpy.logical_not(values)) >>> weights_stretched[0][0] = 1 >>> masked masked_array(data = [[-- 2 --] [1 -- 3] [-- 2 --] [1 -- 3]], mask = [[ True False True] [False True False] [ True False True] [False True False]], fill_value=999999)

Now we can just pass it to the sum:

 >>> sum(masked, axis=1) masked_array(data = [2 4 2 4], mask = [False False False False], fill_value=999999)

I compared numpy.dot and higher with an array of 1,000,000 x 30. This is the result for a modern MacBook Pro ( numpy.dot is dot1 , my dot2 ):

 >>> %timeit dot1(values, weights) 1 loops, best of 3: 194 ms per loop >>> %timeit dot2(values, weights) 1 loops, best of 3: 459 ms per loop

As you can see, the numpy inline solution is faster. But stride_tricks worth knowing about something, so I leave it stride_tricks .

+3

senderle Apr 19 '12 at 2:07

source share

Will this work for you?

 a = np.array([sum(row * weights) for row in values])

It uses sum() to immediately sum the row * weights , so you don't need memory to store all the intermediate values. Then a list comprehension collects all the values.

You said you want to avoid all that "loops in Python." This, at the very least, makes the loop with the Python C guts rather than an explicit Python loop, but it cannot be as fast as the NumPy solution, because it uses compiled C or Fortran.

+1

steveha Apr 19 '12 at 1:16

source share

I don't think you need numpy for something like that. And 1,000,000 to 3,000 is a huge array; it most likely will not fit in your RAM.

I would do it like this:

Let's say that the data is initially in a text file:

 False,True,False True,False,True False,True,False True,False,True

My code is:

 weight = range(1,4) dicto = {'True':1, 'False':0} with open ('my_data.txt') as fin: a = sum(sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin)

Result:

 >>> a 12

EDIT:

I think that I misunderstood the question for the first time, and summarized everything together. Here is a solution that gives the exact solution that OP is after:

 weight = range(1,4) dicto = {'True':1, 'False':0} with open ('my_data.txt') as fin: a = [sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin]

Result:

 >>> a [2, 4, 2, 4]

0

Akavall Apr 19 '12 at 1:20

source share

huon · Accepted Answer · 2012-04-19T01:17:31+0000

A point product (or internal product) is what you want. It allows you to take a matrix of size m×n and a vector of length n and multiply them together, giving a vector of length m , where each record is a weighted sum of the row of the matrix with elements of vector as weight.

Numpy implements this as array1.dot(array2) (or numpy.dot(array1, array2) in older versions). eg:.

 from numpy import array values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3)) weights = array(range(1, 4)) answers = values.dot(weights) print answers # output: [ 2 4 2 4 ]

(You should compare this using the timeit module .)

Efficiently summarize a small numpy array passed through a ginormous numpy array?

More articles: