Dask array from DataFrame

Is there a way to easily convert a DataFrame of numeric values ​​to an array? Similar to values with pandas DataFrame. I cannot find a way to do this with the API provided, but I would suggest that this is a general operation.

+8
dask
source share
2 answers

Edit: yes, now this is trivial

You can use the .values property

 x = df.values 

Older, now incorrect answer

There is currently no trivial way to do this. This is because dask.array must know the length of all its pieces, and dask.dataframe does not know this length. This cannot be a completely lazy operation.

This can be done using dask.delayed as follows:

 import dask.array as da from dask import compute def to_dask_array(df): partitions = df.to_delayed() shapes = [part.values.shape for part in partitions] dtype = partitions[0].dtype results = compute(dtype, *shapes) # trigger computation to find shape dtype, shapes = results[0], results[1:] chunks = [da.from_delayed(part.values, shape, dtype) for part, shape in zip(partitions, shapes)] return da.concatenate(chunks, axis=0) 
+8
source share

I think there could be another way shorter.

 import dask.array as da import dask.dataframe as df ruta ='...' df = dd.read_csv(...) x = df_reg['column you want to transform in array'] def transf(x): xd=x.to_delayed() full = [da.from_delayed(i, i.compute().shape, i.compute().dtype) for i in xd] return da.concatenate(full) x_array=transf(x) 

Also, if you want to convert a DaskDataframe with N columns, and therefore each element of the array will be the following array:

Array ((x, x2, x3), (y1, y2, y3), ....)

You must reorder:

from

 i.compute().dtype 

to

 i.compute().dtypes 

thanks

+1
source share

All Articles