Can conversion from pandas DataFrame to raw numpy array improve ML performance?

A pandas DataFramehas a restriction on fixed integer data types ( int64). NumPy arrays do not have this limitation; we can use np.int8, for example (we also have different sizes of the float). (The restriction no longer exists.)

Will scikit-learn performance improve in large datasets if we first convert the DataFrame to the original NumPy array with reduced data types (e.g., from np.float64to np.float16)? If so, does this make possible performance improvements only if memory is limited?

It seems that high precision float is often inconsequential for ML regarding computational size and complexity.

If more context is required, I consider applying student ensembles such as RandomForestRegressor for large data sets (4-16 GB, tens of millions of records, consisting of ~ 10-50 functions). However, I am most interested in the general case.

+4
source share
1 answer

The documentation for RandomForestRegressor states that input samples will be converted to dtype=np.float32internally.


, numpy Pandas ( )

numpy dtypes Pandas. ( script ) .csv :

df = pd.read_csv(filename, usecols=[0, 4, 5, 10],
                 dtype={0: np.uint8,
                        4: np.uint32,
                        5: np.uint16,
                        10: np.float16})

dtype DataFrame, Series.astype():

s = pd.Series(...)
s = s.astype(np.float16)

df = pd.DataFrame(...)
df['col1'] = df['col1'].astype(np.float16)

DataFrame , DataFrame.astype():

df = pd.DataFrame(...)
df[['col1', 'col2']] = df[['col1', 'col2']].astype(np.float16)
+2

All Articles