Elegant matrix shift and NaN filling?

I have a specific performance issue. I work with meteorological forecasts, which I compile into a numd 2d array, so

  • dim0 = start time of the forecast series
  • dim1 = forecast horizon, for example. 0 to 120 hours

Now I would like dim0 to get hourly intervals, but some sources only give predictions every N hours. As an example, let's say N = 3, and the time step in dim1 is M = 1 hour. Then i get something like

12:00 11.2 12.2 14.0 15.0 11.3 12.0 13:00 nan nan nan nan nan nan 14:00 nan nan nan nan nan nan 15:00 14.7 11.5 12.2 13.0 14.3 15.1 

But, of course, there is information at 13:00 and 14:00, as it can be filled in from the forecast run at 12:00. So I would like to get something like this:

 12:00 11.2 12.2 14.0 15.0 11.3 12.0 13:00 12.2 14.0 15.0 11.3 12.0 nan 14:00 14.0 15.0 11.3 12.0 nan nan 15:00 14.7 11.5 12.2 13.0 14.3 15.1 

What is the fastest way to get there if dim0 is in the order of 1e4 and dim1 is in the order of 1e2? Right now I am doing this line by line, but it is very slow:

 nRows, nCols = dat.shape if N >= M: assert(N % M == 0) # must have whole numbers for i in range(1, nRows): k = np.array(np.where(np.isnan(self.dat[i, :]))) k = k[k < nCols - N] # do not overstep self.dat[i, k] = self.dat[i-1, k+N] 

I'm sure there should be a more elegant way to do this? Any hints would be greatly appreciated.

+7
python numpy nan
source share
4 answers

Here, the power of boolean indexing !!!

 def shift_nans(arr) : while True: nan_mask = np.isnan(arr) write_mask = nan_mask[1:, :-1] read_mask = nan_mask[:-1, 1:] write_mask &= ~read_mask if not np.any(write_mask): return arr arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask] 

I think naming is self-evident what is happening. Getting the right chopping is a pain, but it seems to work:

 In [214]: shift_nans_bis(test_data) Out[214]: array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ], [ 12.2, 14. , 15. , 11.3, 12. , nan], [ 14. , 15. , 11.3, 12. , nan, nan], [ 14.7, 11.5, 12.2, 13. , 14.3, 15.1], [ 11.5, 12.2, 13. , 14.3, 15.1, nan], [ 15.7, 16.5, 17.2, 18. , 14. , 12. ]]) 

And for the timings:

 tmp1 = np.random.uniform(-10, 20, (1e4, 1e2)) nan_idx = np.random.randint(30, 1e4 - 1,1e4) tmp1[nan_idx] = np.nan tmp1 = tmp.copy() import timeit t1 = timeit.timeit(stmt='shift_nans(tmp)', setup='from __main__ import tmp, shift_nans', number=1) t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion code setup='from __main__ import tmp1, shift_time', number=1) In [242]: t1, t2 Out[242]: (0.12696346416487359, 0.3427293070417363) 
+5
source share

Slice your data with a=yourdata[:,1:] .

 def shift_time(dat): #Find number of required iterations check=np.where(np.isnan(dat[:,0])==False)[0] maxiters=np.max(np.diff(check))-1 #No sense in iterations where it just updates nans cols=dat.shape[1] if cols<maxiters: maxiters=cols-1 for iters in range(maxiters): #Find nans col_loc,row_loc=np.where(np.isnan(dat[:,:-1])) dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)] a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0], [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], [14.7,11.5,12.2,13.0,14.3,15.]]) shift_time(a) print a [[ 11.2 12.2 14. 15. 11.3 12. ] [ 12.2 14. 15. 11.3 12. nan] [ 14. 15. 11.3 12. nan nan] [ 14.7 11.5 12.2 13. 14.3 15. ]] 

To use your data as is, or you can modify it a bit to take it directly, but this is apparently a clear way to show this:

 shift_time(yourdata[:,1:]) #Updates in place, no need to return anything. 

Using the tiago test:

 tmp = np.random.uniform(-10, 20, (1e4, 1e2)) nan_idx = np.random.randint(30, 1e4 - 1,1e4) tmp[nan_idx] = np.nan t=time.time() shift_time(tmp,maxiter=1E5) print time.time()-t 0.364198923111 (seconds) 

If you are really smart, you should leave with one np.where .

+2
source share

This seems like a trick:

 import numpy as np def shift_time(dat): NX, NY = dat.shape for i in range(NY): x, y = np.where(np.isnan(dat)) xr = x - 1 yr = y + 1 idx = (xr >= 0) & (yr < NY) dat[x[idx], y[idx]] = dat[xr[idx], yr[idx]] return 

Now with some test data:

 In [1]: test_data = array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ], [ nan, nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan, nan], [ 14.7, 11.5, 12.2, 13. , 14.3, 15.1], [ nan, nan, nan, nan, nan, nan], [ 15.7, 16.5, 17.2, 18. , 14. , 12. ]]) In [2]: shift_time(test_data) In [3]: print test_data Out [3]: array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ], [ 12.2, 14. , 15. , 11.3, 12. , nan], [ 14. , 15. , 11.3, 12. , nan, nan], [ 14.7, 11.5, 12.2, 13. , 14.3, 15.1], [ 11.5, 12.2, 13. , 14.3, 15.1, nan], [ 15.7, 16.5, 17.2, 18. , 14. , 12. ]]) 

And testing with an array (1e4, 1e2):

 In [1]: tmp = np.random.uniform(-10, 20, (1e4, 1e2)) In [2]: nan_idx = np.random.randint(30, 1e4 - 1,1e4) In [3]: tmp[nan_idx] = nan In [4]: time test3(tmp) CPU times: user 1.53 s, sys: 0.06 s, total: 1.59 s Wall time: 1.59 s 
+1
source share

Each iteration of this pad, roll, roll combines what you are looking for:

 import numpy as np from numpy import nan as nan # Startup array A = np.array([[11.2, 12.2, 14.0, 15.0, 11.3, 12.0], [nan, nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan, nan], [14.7, 11.5, 12.2, 13.0, 14.3, 15.1]]) def pad_nan(v, pad_width, iaxis, kwargs): v[:pad_width[0]] = nan v[-pad_width[1]:] = nan return v def roll_data(A): idx = np.isnan(A) A[idx] = np.roll(np.roll(np.pad(A,1, pad_nan),1,0), -1, 1)[1:-1,1:-1][idx] return A print A print roll_data(A) print roll_data(A) 

The output gives:

 [[ 11.2 12.2 14. 15. 11.3 12. ] [ nan nan nan nan nan nan] [ nan nan nan nan nan nan] [ 14.7 11.5 12.2 13. 14.3 15.1]] [[ 11.2 12.2 14. 15. 11.3 12. ] [ 12.2 14. 15. 11.3 12. nan] [ nan nan nan nan nan nan] [ 14.7 11.5 12.2 13. 14.3 15.1]] [[ 11.2 12.2 14. 15. 11.3 12. ] [ 12.2 14. 15. 11.3 12. nan] [ 14. 15. 11.3 12. nan nan] [ 14.7 11.5 12.2 13. 14.3 15.1]] 

Everything is purely numpy, so each iteration should be extremely fast. However, I'm not sure about the cost of creating a populated array and running a few iterations, if you try it, let me know the results!

0
source share

All Articles