Remove row cancellation from Pandas Dataframe

I have a list of accounts sent to customers. However, sometimes a bad invoice is sent, which is later canceled. My Pandas Dataframe looks something like this except for the much larger (~ 3 million rows)

index | customer | invoice_nr | amount | date --------------------------------------------------- 0 | 1 | 1 | 10 | 01-01-2016 1 | 1 | 1 | -10 | 01-01-2016 2 | 1 | 1 | 11 | 01-01-2016 3 | 1 | 2 | 10 | 02-01-2016 4 | 2 | 3 | 7 | 01-01-2016 5 | 2 | 4 | 12 | 02-01-2016 6 | 2 | 4 | 8 | 02-01-2016 7 | 2 | 4 | -12 | 02-01-2016 8 | 2 | 4 | 4 | 02-01-2016 ... | ... | ... | ... | ... ... | ... | ... | ... | ... 

Now I want to delete all rows for which customer , invoice_nr and date identical, but amount has opposite values.
Corrections of invoices always occur on the same day with the same invoice number. The account number is uniquely associated with the client and always corresponds to one transaction (which may consist of several components, for example, for customer = 2 , invoice_nr = 4 ). Corrections of invoices occur either to change the amount , or to split the amount into smaller components. Therefore, the canceled value is not repeated on the same invoice_nr .

Any help on how to program this would be greatly appreciated.

+5
source share
3 answers
 def remove_cancelled_transactions(df): trans_neg = df.amount < 0 return df.loc[~(trans_neg | trans_neg.shift(-1))] groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()] df.groupby(groups, as_index=False, group_keys=False) \ .apply(remove_cancelled_transactions) 

enter image description here

+2
source

You can use filter all values, where each group has values, where the sum is 0 and modulo 2 is 0 :

 print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()]) .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0))) customer invoice_nr amount date index 0 1 1 10 01-01-2016 1 1 1 -10 01-01-2016 5 2 4 12 02-01-2016 6 2 4 -12 02-01-2016 idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()]) .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index print (idx) Int64Index([0, 1, 5, 6], dtype='int64', name='index') print (df.drop(idx)) customer invoice_nr amount date index 2 1 1 11 01-01-2016 3 1 2 10 02-01-2016 4 2 3 7 01-01-2016 7 2 4 8 02-01-2016 8 2 4 4 02-01-2016 

EDIT by comments:

If in real data there are no duplicates for one account and one client and one date, you can use this method:

  print (df) index customer invoice_nr amount date 0 0 1 1 10 01-01-2016 1 1 1 1 -10 01-01-2016 2 2 1 1 11 01-01-2016 3 3 1 2 10 02-01-2016 4 4 2 3 7 01-01-2016 5 5 2 4 12 02-01-2016 6 6 2 4 -12 02-01-2016 7 7 2 4 8 02-01-2016 8 8 2 4 4 02-01-2016 df['amount_abs'] = df.amount.abs() df.drop_duplicates(['customer','invoice_nr', 'date', 'amount_abs'], keep=False, inplace=True) df.drop('amount_abs', axis=1, inplace=True) print (df) index customer invoice_nr amount date 2 2 1 1 11 01-01-2016 3 3 1 2 10 02-01-2016 4 4 2 3 7 01-01-2016 7 7 2 4 8 02-01-2016 8 8 2 4 4 02-01-2016 
+2
source

What if you just do groupby on all three fields? The resulting amounts will exclude any canceled accounts:

 df2 = df.groupby(['customer','invoice_nr','date']).sum() 

leads to

 customer invoice_nr date 1 1 2016/01/01 11 2 2016/02/01 10 2 3 2016/01/01 7 
0
source

All Articles