Scalable array size in Python

I just stumbled upon the need for an incremental Numpy array in Python, and since I did not find anything that I implemented it. I'm just wondering if my method is the best, or you can come up with other ideas.

So, the problem is that I have a 2D array (the program processes nD arrays), for which the size is not known in advance, and a variable amount of data needs to be combined into an array in one direction (let's say that I often call np.vstak). Every time I concatenate data, I need to take an array, sort it along the 0 axis and do other things, so I can’t build a long list of arrays, and then np.vstak the list right away. Since memory allocation is expensive, I turned to incremental arrays, where I increase the size of the array to a larger size than the required size (I use 50% increments), so I minimize the number of allocations.

I encoded this and you can see it in the following code:

class ExpandingArray: __DEFAULT_ALLOC_INIT_DIM = 10 # default initial dimension for all the axis is nothing is given by the user __DEFAULT_MAX_INCREMENT = 10 # default value in order to limit the increment of memory allocation __MAX_INCREMENT = [] # Max increment __ALLOC_DIMS = [] # Dimensions of the allocated np.array __DIMS = [] # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS) __ARRAY = [] # Allocated array def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None): self.__DIMS = np.array(initData.shape) self.__MAX_INCREMENT = maxIncrement if self.__MAX_INCREMENT == None: self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT # Compute the allocation dimensions based on user input if allocInitDim == None: allocInitDim = self.__DIMS.copy() while np.any( allocInitDim < self.__DIMS ) or np.any(allocInitDim == 0): for i in range(len(self.__DIMS)): if allocInitDim[i] == 0: allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM if allocInitDim[i] < self.__DIMS[i]: allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT) # Allocate memory self.__ALLOC_DIMS = allocInitDim self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype) # Set initData sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] self.__ARRAY[sliceIdxs] = initData def shape(self): return tuple(self.__DIMS) def getAllocArray(self): return self.__ARRAY def getDataArray(self): """ Get the view of the array with data """ sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] return self.__ARRAY[sliceIdxs] def concatenate(self,X,axis=0): if axis > len(self.__DIMS): print "Error: axis number exceed the number of dimensions" return # Check dimensions for remaining axis for i in range(len(self.__DIMS)): if i != axis: if X.shape[i] != self.shape()[i]: print "Error: Dimensions of the input array are not consistent in the axis %d" % i return # Check whether allocated memory is enough needAlloc = False while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]: needAlloc = True # Increase the __ALLOC_DIMS self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT) # Reallocate memory and copy old data if needAlloc: # Allocate newArray = np.zeros(self.__ALLOC_DIMS) # Copy sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] newArray[sliceIdxs] = self.__ARRAY[sliceIdxs] self.__ARRAY = newArray # Concatenate new data sliceIdxs = [] for i in range(len(self.__DIMS)): if i != axis: sliceIdxs.append(slice(self.__DIMS[i])) else: sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i])) self.__ARRAY[sliceIdxs] = X self.__DIMS[axis] += X.shape[axis] 

The code shows significantly better characteristics than vstack / hstack, several concatenations of arbitrary size.

I am wondering: is this the best way? Is there anything that does this already in numpy?

In addition, it would be nice to be able to overload the slice assignment operator for np.array, so that as soon as the user assigns something outside the actual dimensions, ExpandingArray.concatenate () is executed. How to do such an overload?

Code testing . Here I also place the code that I used to compare between vstack and my method. I am adding a random piece of data with a maximum length of 100.

 import time N = 10000 def performEA(N): EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000) for i in range(N): nNew = np.random.random_integers(low=1,high=100,size=1) X = np.random.rand(nNew,2) EA.concatenate(X,axis=0) # Perform operations on EA.getDataArray() return EA def performVStack(N): A = np.zeros((0,2)) for i in range(N): nNew = np.random.random_integers(low=1,high=100,size=1) X = np.random.rand(nNew,2) A = np.vstack((A,X)) # Perform operations on A return A start_EA = time.clock() EA = performEA(N) stop_EA = time.clock() start_VS = time.clock() VS = performVStack(N) stop_VS = time.clock() print "Elapsed Time EA: %.2f" % (stop_EA-start_EA) print "Elapsed Time VS: %.2f" % (stop_VS-start_VS) 
+4
source share
2 answers

I think the most common design pattern for these things is simply using a list for small arrays. Of course, you can do things like dynamic resizing (if you want to do crazy things, you can also try using the resize method). I think the typical method is to always double the size when you really don't know how big things will be. Of course, if you know how big the array will be, it’s easy to simply select everything that lies ahead, the simplest.

 def performVStack_fromlist(N): l = [] for i in range(N): nNew = np.random.random_integers(low=1,high=100,size=1) X = np.random.rand(nNew,2) l.append(X) return np.vstack(l) 

I'm sure there are some use cases where an expanding array can be useful (for example, when all the add arrays are very small), but this loop is better handled using the above pattern. Optimization is mainly related to how often you need to copy everything around and make a list like this (except for the list itself), this is exactly here. So it is much faster.

+2
source

When I ran into a similar problem, I used ndarray.resize () ( http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.resize.html#numpy.ndarray.resize ). Most of the time he will avoid redistributing + copying altogether. I cannot guarantee that it will be faster (perhaps it will be possible), but it is much simpler.

As for your second question, I think redefining a slice assignment for extension purposes is not a good idea. This operator is intended to be assigned to existing elements / slices. If you want to change this, it will not immediately clear how you want it to behave in some cases, for example:

 a = MyExtendableArray(np.arange(100)) a[200] = 6 # resize to 200? pad [100:200] with what? a[90:110] = 7 # assign to existing items AND automagically-allocated items? a[::-1][200] = 6 # ... 

My suggestion is that the purpose of slice and the addition of data should remain separate.

+2
source

All Articles