Can conversion from pandas DataFrame to raw numpy array improve ML performance?

Question

Can conversion from pandas DataFrame to raw numpy array improve ML performance?

~~A pandas DataFramehas a restriction on fixed integer data types ( int64). NumPy arrays do not have this limitation; we can use np.int8, for example (we also have different sizes of the float).~~ (The restriction no longer exists.)

Will scikit-learn performance improve in large datasets if we first convert the DataFrame to the original NumPy array with reduced data types (e.g., from np.float64to np.float16)? If so, does this make possible performance improvements only if memory is limited?

It seems that high precision float is often inconsequential for ML regarding computational size and complexity.

If more context is required, I consider applying student ensembles such as RandomForestRegressor for large data sets (4-16 GB, tens of millions of records, consisting of ~ 10-50 functions). However, I am most interested in the general case.

+4

performance python numpy pandas scikit-learn

Brian bien Jun 29 '16 at 12:52

source share

1 answer

Alberto Garcia-Raboso · Accepted Answer · 2016-06-29T13:40:31+0000

The documentation for RandomForestRegressor states that input samples will be converted to dtype=np.float32internally.

, numpy Pandas ( )

numpy dtypes Pandas. ( script ) .csv :

df = pd.read_csv(filename, usecols=[0, 4, 5, 10],
                 dtype={0: np.uint8,
                        4: np.uint32,
                        5: np.uint16,
                        10: np.float16})

dtype DataFrame, Series.astype():

s = pd.Series(...)
s = s.astype(np.float16)

df = pd.DataFrame(...)
df['col1'] = df['col1'].astype(np.float16)

DataFrame , DataFrame.astype():

df = pd.DataFrame(...)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(np.float16)

Can conversion from pandas DataFrame to raw numpy array improve ML performance?

More articles: