Prevent numpy from creating a multidimensional array

NumPy is really useful when creating arrays. If the first argument to numpy.array has the __getitem__ and __len__ , they are used based on the fact that it may be a valid sequence.

Unfortunately, I want to create an array containing dtype=object if NumPy is not "useful".

Broken down to a minimal example, the class would like:

 import numpy as np class Test(object): def __init__(self, iterable): self.data = iterable def __getitem__(self, idx): return self.data[idx] def __len__(self): return len(self.data) def __repr__(self): return '{}({})'.format(self.__class__.__name__, self.data) 

and if the "iterators" have different lengths, everything is fine, and I get exactly the result that I want to have:

 >>> np.array([Test([1,2,3]), Test([3,2])], dtype=object) array([Test([1, 2, 3]), Test([3, 2])], dtype=object) 

but NumPy creates a multidimensional array if they are the same length:

 >>> np.array([Test([1,2,3]), Test([3,2,1])], dtype=object) array([[1, 2, 3], [3, 2, 1]], dtype=object) 

Unfortunately, there is only the ndmin argument, so I was wondering if there is a way to force ndmax to be ndmax or somehow prevent NumPy from interpreting user classes as another dimension (without removing __len__ or __getitem__ )

+7
source share
4 answers

The workaround is, of course, creating an array of the desired shape, and then copying the data:

 In [19]: lst = [Test([1, 2, 3]), Test([3, 2, 1])] In [20]: arr = np.empty(len(lst), dtype=object) In [21]: arr[:] = lst[:] In [22]: arr Out[22]: array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object) 

Note that in any case, I would not be surprised if the numpy behavior of wrt to interpret iterable objects (this is what you want to use, right?) Depends on several values. And maybe it's buggy. Or maybe some of these errors are actually functions. Anyway, I fear a breakdown when changing the numpy version.

Conversely, copying to a pre-created array should be more reliable.

+4
source

This behavior has been discussed several times before (for example, Override dict with numpy support ). np.array trying to make as large a dimension array as possible. A model example is nested lists. If it can iterate, and the followers are equal in length, it will β€œdrill” down.

Here he went down to 2 levels before meeting with lists of different lengths:

 In [250]: np.array([[[1,2],[3]],[1,2]],dtype=object) Out[250]: array([[[1, 2], [3]], [1, 2]], dtype=object) In [251]: _.shape Out[251]: (2, 2) 

Without the shape or ndmax parameter, he does not know if I want it to be (2,) or (2,2) . Both will work with dtype.

Compiled code, so it’s not entirely accurate to see what tests it uses. He tries to iterate over lists and tuples, but not over sets or dictionaries.

The surest way to create an array of objects with a given size is to start from empty and fill it

 In [266]: A=np.empty((2,3),object) In [267]: A.fill([[1,'one']]) In [276]: A[:]={1,2} In [277]: A[:]=[1,2] # broadcast error 

Another way is to start with at least one other element (for example, a None ), and then replace it.

A more primitive ndarray creator is ndarray :

 In [280]: np.ndarray((2,3),dtype=object) Out[280]: array([[None, None, None], [None, None, None]], dtype=object) 

But this is basically the same as np.empty (unless I give it a buffer).

These are inventions, but they are not expensive (temporary).

================== (edit)

https://github.com/numpy/numpy/issues/5933 , Enh: Object array creation function. is an extension request. Also https://github.com/numpy/numpy/issues/5303 the error message for accidentally irregular arrays is confusing .

The idea of ​​the developer seems preferable to a separate function for creating arrays of dtype=object , with great control over the initial sizes and depth of iteration. They can even leverage error checking to keep np.array from creating irregular arrays.

Such a function can detect the shape of a regular nested iterable to a specified depth and build an array of the type of object to be filled.

 def objarray(alist, depth=1): shape=[]; l=alist for _ in range(depth): shape.append(len(l)) l = l[0] arr = np.empty(shape, dtype=object) arr[:]=alist return arr 

With various depths:

 In [528]: alist=[[Test([1,2,3])], [Test([3,2,1])]] In [529]: objarray(alist,1) Out[529]: array([[Test([1, 2, 3])], [Test([3, 2, 1])]], dtype=object) In [530]: objarray(alist,2) Out[530]: array([[Test([1, 2, 3])], [Test([3, 2, 1])]], dtype=object) In [531]: objarray(alist,3) Out[531]: array([[[1, 2, 3]], [[3, 2, 1]]], dtype=object) In [532]: objarray(alist,4) ... TypeError: object of type 'int' has no len() 
+7
source

This workaround may not be the most efficient, but I like its clarity:

 test_list = [Test([1,2,3]), Test([3,2,1])] test_list.append(None) test_array = np.array(test_list, dtype=object)[:-1] 

Description: You take your list, add None, then convert to a numpy array, preventing the conversion of numpy to a multidimensional array. Finally, you simply delete the last entry to get the structure you want.

0
source

Pandas Bypass

This may not be what the OP is looking for. But, just in case, if someone is looking for a way to prevent the creation of multidimensional numpy arrays, this might be useful.


Pass your list to pd.Series and then get the elements as an array using .values .

 import pandas as pd pd.Series([Test([1,2,3]), Test([3,2,1])]).values # array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object) 

Or if you are dealing with numpy arrays:

 np.array([np.random.randn(2,2), np.random.randn(2,2)]).shape (2, 2, 2) 

Using pd.Series :

 pd.Series([np.random.randn(2,2), np.random.randn(2,2)]).values.shape #(2,) 
0
source

All Articles