I just stumbled upon the need for an incremental Numpy array in Python, and since I did not find anything that I implemented it. I'm just wondering if my method is the best, or you can come up with other ideas.
So, the problem is that I have a 2D array (the program processes nD arrays), for which the size is not known in advance, and a variable amount of data needs to be combined into an array in one direction (let's say that I often call np.vstak). Every time I concatenate data, I need to take an array, sort it along the 0 axis and do other things, so I canβt build a long list of arrays, and then np.vstak the list right away. Since memory allocation is expensive, I turned to incremental arrays, where I increase the size of the array to a larger size than the required size (I use 50% increments), so I minimize the number of allocations.
I encoded this and you can see it in the following code:
class ExpandingArray: __DEFAULT_ALLOC_INIT_DIM = 10 # default initial dimension for all the axis is nothing is given by the user __DEFAULT_MAX_INCREMENT = 10 # default value in order to limit the increment of memory allocation __MAX_INCREMENT = [] # Max increment __ALLOC_DIMS = [] # Dimensions of the allocated np.array __DIMS = [] # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS) __ARRAY = [] # Allocated array def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None): self.__DIMS = np.array(initData.shape) self.__MAX_INCREMENT = maxIncrement if self.__MAX_INCREMENT == None: self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT # Compute the allocation dimensions based on user input if allocInitDim == None: allocInitDim = self.__DIMS.copy() while np.any( allocInitDim < self.__DIMS ) or np.any(allocInitDim == 0): for i in range(len(self.__DIMS)): if allocInitDim[i] == 0: allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM if allocInitDim[i] < self.__DIMS[i]: allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT) # Allocate memory self.__ALLOC_DIMS = allocInitDim self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype) # Set initData sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] self.__ARRAY[sliceIdxs] = initData def shape(self): return tuple(self.__DIMS) def getAllocArray(self): return self.__ARRAY def getDataArray(self): """ Get the view of the array with data """ sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] return self.__ARRAY[sliceIdxs] def concatenate(self,X,axis=0): if axis > len(self.__DIMS): print "Error: axis number exceed the number of dimensions" return # Check dimensions for remaining axis for i in range(len(self.__DIMS)): if i != axis: if X.shape[i] != self.shape()[i]: print "Error: Dimensions of the input array are not consistent in the axis %d" % i return # Check whether allocated memory is enough needAlloc = False while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]: needAlloc = True # Increase the __ALLOC_DIMS self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT) # Reallocate memory and copy old data if needAlloc: # Allocate newArray = np.zeros(self.__ALLOC_DIMS) # Copy sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))] newArray[sliceIdxs] = self.__ARRAY[sliceIdxs] self.__ARRAY = newArray # Concatenate new data sliceIdxs = [] for i in range(len(self.__DIMS)): if i != axis: sliceIdxs.append(slice(self.__DIMS[i])) else: sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i])) self.__ARRAY[sliceIdxs] = X self.__DIMS[axis] += X.shape[axis]
The code shows significantly better characteristics than vstack / hstack, several concatenations of arbitrary size.
I am wondering: is this the best way? Is there anything that does this already in numpy?
In addition, it would be nice to be able to overload the slice assignment operator for np.array, so that as soon as the user assigns something outside the actual dimensions, ExpandingArray.concatenate () is executed. How to do such an overload?
Code testing . Here I also place the code that I used to compare between vstack and my method. I am adding a random piece of data with a maximum length of 100.
import time N = 10000 def performEA(N): EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000) for i in range(N): nNew = np.random.random_integers(low=1,high=100,size=1) X = np.random.rand(nNew,2) EA.concatenate(X,axis=0)