Dask DataFrame Group Sections

I have quite large csv files (~ 10gb) and I would like to use dask for analysis. However, depending on the number of partitions, I set the dask object to be read, with the results of my groupby. My understanding was that dask took advantage of partitions for out-of-queue processing benefits, but that it would still return the appropriate group output. This doesn't seem to be the case, and I'm struggling to determine which alternative settings are needed. Below is a small example:

df = pd.DataFrame({'A': np.arange(100), 'B': np.random.randn(100), 'C': np.random.randn(100), 'Grp1': np.repeat([1, 2], 50), 'Grp2': [3, 4, 5, 6], 25)}) test_dd1 = dd.from_pandas(df, npartitions=1) test_dd2 = dd.from_pandas(df, npartitions=2) test_dd5 = dd.from_pandas(df, npartitions=5) test_dd10 = dd.from_pandas(df, npartitions=10) test_dd100 = dd.from_pandas(df, npartitions=100) def test_func(x): x['New_Col'] = len(x[x['B'] > 0.]) / len(x['B']) return x test_dd1.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48 test_dd2.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48 test_dd5.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0.45 1 1 -1.107799 1.075471 1 3 0.45 2 2 -0.719420 -0.574381 1 3 0.45 3 3 -1.287547 -0.749218 1 3 0.45 4 4 0.677617 -0.908667 1 3 0.45 test_dd10.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0.5 1 1 -1.107799 1.075471 1 3 0.5 2 2 -0.719420 -0.574381 1 3 0.5 3 3 -1.287547 -0.749218 1 3 0.5 4 4 0.677617 -0.908667 1 3 0.5 test_dd100.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0 1 1 -1.107799 1.075471 1 3 0 2 2 -0.719420 -0.574381 1 3 0 3 3 -1.287547 -0.749218 1 3 0 4 4 0.677617 -0.908667 1 3 1 df.groupby(['Grp1', 'Grp2']).apply(test_func).head() ABC Grp1 Grp2 New_Col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48 

Is a group step performed only in each section, and not in viewing the full data frame? In this case, it is trivial to set npartitions = 1, and this doesn’t seem to affect performance much, but since read_csv automatically sets a certain number of partitions, how do you set up the call to make sure the group results are accurate?

Thanks!

+6
source share
1 answer

I am surprised at this result. Groupby.apply should return the same results regardless of the number of sections. If you can provide a reproducible example, I recommend that you raise a question and one of the developers will take a look.

+2
source

All Articles