I am trying to aggregate some statistics from a groupby object on pieces of data. I need to write data, because there are a lot of (18 million) lines. I want to find the number of rows in each group in each fragment, and then sum them up. I can add groupby objects, but when the group is not present in one member, the result is NaN. See this case:
>>> df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': range(12)})
>>> df
X Y
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
7 C 7
8 D 8
9 B 9
10 C 10
11 D 11
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A NaN
B 4
C 4
D NaN
But I want to see:
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A 2
B 4
C 4
D 2
Is there a good way to do this? Note that in real code, I am looking at a fragmented iterator of a million lines in a group.
Kyle source
share