Creating a Pandas DataFrame with a numpy array containing several types

Question

Creating a Pandas DataFrame with a numpy array containing several types

I want to create a pandas framework with default values equal to zero, but one column of integers and another floats. I can create a numpy array with the correct types, see below the values variable. However, when I pass this to the dataframe constructor, it returns NaN values (see below df ). I mean untyped code that returns an array of floats (see df2 )

 import pandas as pd import numpy as np values = np.zeros((2,3), dtype='int32,float32') index = ['x', 'y'] columns = ['a','b','c'] df = pd.DataFrame(data=values, index=index, columns=columns) df.values.dtype values2 = np.zeros((2,3)) df2 = pd.DataFrame(data=values2, index=index, columns=columns) df2.values.dtype

Any suggestions for creating a data block?

+13

python numpy pandas

bfcondon Feb 08 '14 at 14:12

source share

1 answer

unutbu · Accepted Answer · 2014-02-08 14:25

Here are a few options you could choose:

 import numpy as np import pandas as pd index = ['x', 'y'] columns = ['a','b','c'] # Option 1: Set the column names in the structured array dtype dtype = [('a','int32'), ('b','float32'), ('c','float32')] values = np.zeros(2, dtype=dtype) df = pd.DataFrame(values, index=index) # Option 2: Alter the structured array column names after it has been created values = np.zeros(2, dtype='int32, float32, float32') values.dtype.names = columns df2 = pd.DataFrame(values, index=index, columns=columns) # Option 3: Alter the DataFrame column names after it has been created values = np.zeros(2, dtype='int32, float32, float32') df3 = pd.DataFrame(values, index=index) df3.columns = columns # Option 4: Use a dict of arrays, each of the right dtype: df4 = pd.DataFrame( {'a': np.zeros(2, dtype='int32'), 'b': np.zeros(2, dtype='float32'), 'c': np.zeros(2, dtype='float32')}, index=index, columns=columns) # Option 5: Concatenate DataFrames of the simple dtypes: df5 = pd.concat([ pd.DataFrame(np.zeros((2,), dtype='int32'), columns=['a']), pd.DataFrame(np.zeros((2,2), dtype='float32'), columns=['b','c'])], axis=1) # Option 6: Alter the dtypes after the DataFrame has been formed. (This is not very efficient) values2 = np.zeros((2, 3)) df6 = pd.DataFrame(values2, index=index, columns=columns) for col, dtype in zip(df6.columns, 'int32 float32 float32'.split()): df6[col] = df6[col].astype(dtype)

Each of the above options gives the same result.

  abc x 0 0 0 y 0 0 0

with dtypes:

 a int32 b float32 c float32 dtype: object

Why pd.DataFrame(values, index=index, columns=columns) creates a DataFrame with NaN :

values is a structured array with column names f0 , f1 , f2 :

 In [171]: values Out[172]: array([(0, 0.0, 0.0), (0, 0.0, 0.0)], dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<f4')])

If you pass the argument columns=['a', 'b', 'c'] to pd.DataFrame , then Pandas will look for columns with these names in a structured array of values . When these columns are not found, Pandas places a NaN in the DataFrame to represent the missing values.

Creating a Pandas DataFrame with a numpy array containing several types

More articles: