Aggregate operations on dtype objects

Question

Aggregate operations on dtype objects

I am trying to figure out how I can apply cumulative functions to objects. There are several alternatives for numbers, such as cumsum and cumcount . There is also df.expanding which can be used with apply . But the functions that I pass to apply do not work with objects.

 import pandas as pd df = pd.DataFrame({"C1": [1, 2, 3, 4], "C2": [{"A"}, {"B"}, {"C"}, {"D"}], "C3": ["A", "B", "C", "D"], "C4": [["A"], ["B"], ["C"], ["D"]]}) df Out: C1 C2 C3 C4 0 1 {A} A [A] 1 2 {B} B [B] 2 3 {C} C [C] 3 4 {D} D [D]

In the dataframe, I have integer values, sets, rows and lists. Now, if I try expanding().apply(sum) , I have the total amount:

 df.expanding().apply(sum) Out[69]: C1 C2 C3 C4 0 1.0 {A} A [A] 1 3.0 {B} B [B] 2 6.0 {C} C [C] 3 10.0 {D} D [D]

My expectation was, since the summation is defined in lists and rows, I would get something like the following:

  C1 C2 C3 C4 0 1.0 {A} A [A] 1 3.0 {B} AB [A, B] 2 6.0 {C} ABC [A, B, C] 3 10.0 {D} ABCD [A, B, C, D]

I also tried something like this:

 df.expanding().apply(lambda r: reduce(lambda x, y: x+y**2, r)) Out: C1 C2 C3 C4 0 1.0 {A} A [A] 1 5.0 {B} B [B] 2 14.0 {C} C [C] 3 30.0 {D} D [D]

It works as I expect: the previous result is x , and the current value of the string is y . But I can not reduce the use of x.union(y) , for example.

So my question is: are there any expanding alternatives that I can use for objects? This example shows that expanding().apply() does not work with dtypes objects. I am looking for a general solution that supports applying functions to these two inputs: the previous result and the current element.

+3

python pandas dataframe

user2285236 Apr 19 '16 at 11:45

source share

3 answers

I think you can use cumsum with set exception, then you need to convert to list first and then to set . Btw, saving set ( C2 ) or lists lists ( C4 ) in columns in a DataFrame not recommended.

 print df C1 C2 C3 C4 0 1 {A} A [A] 1 2 {B} B [B] 2 3 {C} C [C] 3 4 {D} D [D] print df[['C1','C3','C4']].cumsum() C1 C3 C4 0 1 A [A] 1 3 AB [A, B] 2 6 ABC [A, B, C] 3 10 ABCD [A, B, C, D] df['C2'] = df['C2'].apply(list) df = df.cumsum() df['C2'] = df['C2'].apply(set) print df C1 C2 C3 C4 0 1 {A} A [A] 1 3 {A, B} AB [A, B] 2 6 {A, C, B} ABC [A, B, C] 3 10 {A, C, B, D} ABCD [A, B, C, D]

+2

jezrael Apr 19 '16 at 12:03

source share

well you can define a custom function

 def custom_cumsum(df): from functools import reduce nrows, ncols = df.shape index, columns = df.index, df.columns rets = {} new_col = None for col in df.columns: try: new_col = {col:df.loc[:, col].cumsum()} except TypeError as e: if 'set' in str(e): new_col = {col:[ reduce(set.union, df.loc[:, col][:(i+1)]) for i in range(nrows)]} rets.update(new_col) frame = pd.DataFrame(rets, index=index, columns=columns) return frame

+1

Philchang Apr 19 '16 at 14:19

source share

user2285236 · Accepted Answer · 2017-08-12T20:36:06+0000

It turns out that this is impossible.

Continued on the same pattern:

 def burndowntheworld(ser): print('Are you sure?') return ser/0 df.select_dtypes(['object']).expanding().apply(burndowntheworld) Out: C2 C3 C4 0 {A} A [A] 1 {B} B [B] 2 {C} C [C] 3 {D} D [D]

If the column type is an object, the function is never called. And pandas has no alternative that works with objects. Same thing for rolling().apply() .

In a sense, this is good, because expanding.apply with a user-defined function has O (n ** 2) complexity. In special cases, such as cumsum , ewma , etc., the recursive nature of the operations can reduce complexity to linear time, but in the most general case, it must calculate the function for the first n elements, and then for the first n + 1 element, etc. d. Therefore, especially for a function that depends only on the current value and the function of the previous value, the extension is quite inefficient. Not to mention that storing lists or sets in a DataFrame is never recommended.

So the answer is: if your data is not numeric, and the function depends on the previous result and the current element, just use the for loop. In any case, it will be more efficient.

Aggregate operations on dtype objects

More articles: