Aggregate operations on dtype objects

I am trying to figure out how I can apply cumulative functions to objects. There are several alternatives for numbers, such as cumsum and cumcount . There is also df.expanding which can be used with apply . But the functions that I pass to apply do not work with objects.

 import pandas as pd df = pd.DataFrame({"C1": [1, 2, 3, 4], "C2": [{"A"}, {"B"}, {"C"}, {"D"}], "C3": ["A", "B", "C", "D"], "C4": [["A"], ["B"], ["C"], ["D"]]}) df Out: C1 C2 C3 C4 0 1 {A} A [A] 1 2 {B} B [B] 2 3 {C} C [C] 3 4 {D} D [D] 

In the dataframe, I have integer values, sets, rows and lists. Now, if I try expanding().apply(sum) , I have the total amount:

 df.expanding().apply(sum) Out[69]: C1 C2 C3 C4 0 1.0 {A} A [A] 1 3.0 {B} B [B] 2 6.0 {C} C [C] 3 10.0 {D} D [D] 

My expectation was, since the summation is defined in lists and rows, I would get something like the following:

  C1 C2 C3 C4 0 1.0 {A} A [A] 1 3.0 {B} AB [A, B] 2 6.0 {C} ABC [A, B, C] 3 10.0 {D} ABCD [A, B, C, D] 

I also tried something like this:

 df.expanding().apply(lambda r: reduce(lambda x, y: x+y**2, r)) Out: C1 C2 C3 C4 0 1.0 {A} A [A] 1 5.0 {B} B [B] 2 14.0 {C} C [C] 3 30.0 {D} D [D] 

It works as I expect: the previous result is x , and the current value of the string is y . But I can not reduce the use of x.union(y) , for example.

So my question is: are there any expanding alternatives that I can use for objects? This example shows that expanding().apply() does not work with dtypes objects. I am looking for a general solution that supports applying functions to these two inputs: the previous result and the current element.

+3
python pandas dataframe
source share
3 answers

It turns out that this is impossible.

Continued on the same pattern:

 def burndowntheworld(ser): print('Are you sure?') return ser/0 df.select_dtypes(['object']).expanding().apply(burndowntheworld) Out: C2 C3 C4 0 {A} A [A] 1 {B} B [B] 2 {C} C [C] 3 {D} D [D] 

If the column type is an object, the function is never called. And pandas has no alternative that works with objects. Same thing for rolling().apply() .

In a sense, this is good, because expanding.apply with a user-defined function has O (n ** 2) complexity. In special cases, such as cumsum , ewma , etc., the recursive nature of the operations can reduce complexity to linear time, but in the most general case, it must calculate the function for the first n elements, and then for the first n + 1 element, etc. d. Therefore, especially for a function that depends only on the current value and the function of the previous value, the extension is quite inefficient. Not to mention that storing lists or sets in a DataFrame is never recommended.

So the answer is: if your data is not numeric, and the function depends on the previous result and the current element, just use the for loop. In any case, it will be more efficient.

+2
source share

I think you can use cumsum with set exception, then you need to convert to list first and then to set . Btw, saving set ( C2 ) or lists lists ( C4 ) in columns in a DataFrame not recommended.

 print df C1 C2 C3 C4 0 1 {A} A [A] 1 2 {B} B [B] 2 3 {C} C [C] 3 4 {D} D [D] print df[['C1','C3','C4']].cumsum() C1 C3 C4 0 1 A [A] 1 3 AB [A, B] 2 6 ABC [A, B, C] 3 10 ABCD [A, B, C, D] df['C2'] = df['C2'].apply(list) df = df.cumsum() df['C2'] = df['C2'].apply(set) print df C1 C2 C3 C4 0 1 {A} A [A] 1 3 {A, B} AB [A, B] 2 6 {A, C, B} ABC [A, B, C] 3 10 {A, C, B, D} ABCD [A, B, C, D] 
+2
source share

well you can define a custom function

 def custom_cumsum(df): from functools import reduce nrows, ncols = df.shape index, columns = df.index, df.columns rets = {} new_col = None for col in df.columns: try: new_col = {col:df.loc[:, col].cumsum()} except TypeError as e: if 'set' in str(e): new_col = {col:[ reduce(set.union, df.loc[:, col][:(i+1)]) for i in range(nrows)]} rets.update(new_col) frame = pd.DataFrame(rets, index=index, columns=columns) return frame 
+1
source share

All Articles