Python numpy mask

Well, after some searches, I cannot find an SO question that directly solves this. I was looking for masked arrays, and although they seem cool, I'm not sure if I need them.

consider 2 numpy arrays:

zone_data is an array with 2 num characters with elements with the same value. These are my "zones".

value_data is an array with two values โ€‹โ€‹(the exact form of zone_data) with arbitrary values.

I am looking for a numpy array of the same shape as zone_data / value_data, which has the average values โ€‹โ€‹of each zone instead of zone numbers.

example ... in the form of ascii art.

zone_data (4 different zones):

 1, 1, 2, 2 1, 1, 2, 2 3, 3, 4, 4 3, 4, 4, 4 

value_data :

 1, 2, 3, 6 3, 0, 2, 5 1, 1, 1, 0 2, 4, 2, 1 

my result, name it result_data :

 1.5, 1.5, 4.0, 4.0 1.5, 1.5, 4.0, 4.0 2.0, 2.0, 1.0, 1.0 2.0, 2.0, 1.0, 1.0 

here is the code i have. It works great, which gives me a great result.

 result_data = np.zeros(zone_data.shape) for i in np.unique(zone_data): result_data[zone_data == i] = np.mean(value_data[zone_data == i]) 

My arrays are large, and a piece of code takes a few seconds. I think that I have a knowledge gap and I have not found anything useful. The loop aspect should be delegated to the library or something else ... aarg!

I ask for help to do it QUICKLY! Gods of Python, I seek your wisdom!

EDIT - Adding a Script Reference

 import numpy as np import time zones = np.random.randint(1000, size=(2000,1000)) values = np.random.rand(2000,1000) print 'start method 1:' start_time = time.time() result_data = np.zeros(zones.shape) for i in np.unique(zones): result_data[zones == i] = np.mean(values[zones == i]) print 'done method 1 in %.2f seconds' % (time.time() - start_time) print print 'start method 2:' start_time = time.time() #your method here! print 'done method 2 in %.2f seconds' % (time.time() - start_time) 

my conclusion:

 start method 1: done method 1 in 4.34 seconds start method 2: done method 2 in 0.00 seconds 
+5
source share
2 answers

You can use np.bincount :

 count = np.bincount(zones.flat) tot = np.bincount(zones.flat, weights=values.flat) avg = tot/count result_data2 = avg[zones] 

which gives me

 start method 1: done method 1 in 3.13 seconds start method 2: done method 2 in 0.01 seconds >>> >>> np.allclose(result_data, result_data2) True 
+3
source

I thought I saw it in my cheekbone somewhere, but I can no longer find it. Have you looked there?

In any case, you can get the first improvement by changing your loop:

 result_data = np.empty(zones.shape) # minor speed gain for label in np.unique(zones): mask = zones==label result_data[mask] = np.mean(values[mask]) 

This way you do not need to do a logical comparison twice. This will slightly reduce the execution time.

+1
source

Source: https://habr.com/ru/post/1211301/


All Articles