How to determine and set the column value only for the last occurrence of a repeating row

I am very new to Pandas and Python, so forgive me if this is the main question. In an attempt to solve my problem: Download a few csv files, find the missing product identifier in the subsequent files, calculate the sold date based on it , I made some changes to the way I clean these files. I have the following columns in a data frame loaded from multiple csv files.

store_id stock_number merchandise_id date_acquired color price MSRP csv_date 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 01/31/2017 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 01/31/2017 ... 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 03/31/2017 

The last column is the last occurrence of the item with the item 'YHAN79807'. I managed to find the last event, following How to detect the first occurrence of duplicate rows in the Python Pandas Dataframe and change it a bit. I used

  df1['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1]) 

However, I want to set this value for the dup_index column only for the last occurrence of YHAN79807 as the product identifier. I don’t want the rest of the duplicate rows for β€œYHAN79807” to be like the product ID to have this value. They must be empty. This identifier should only have the last case. I have not been able to do this yet. I tried several things, one of them:

 group = df1.groupby(['merchandiseID']) df1_index = df1.set_index(['merchandiseID']) df1[ (((len(group.indices[ind])-1)==group.indices[df1.merchandiseID])]['dup_index'] = 'succeed' 

I tried adding β€œsuccess” as a first step to see if comparing the columns would produce the result, but this gave me the following error:

  FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison 

result = getattr (x, name) (y) ... raise TypeError ('Cannot compare% s with series'%

I am on my way. What am I missing? Any pointers are appreciated.

it's better

Alice

+1
source share
1 answer

I think you need:

 g = df.groupby(['merchandise_id']) df1 = df.set_index(['merchandise_id']) df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1]) print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date dup_index 0 01/31/2017 0 1 01/31/2017 2 2 03/31/2017 2 

Or, if you want to determine only the last duplicated lines, use double conditions with & :

 print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date 0 01/31/2017 1 01/31/2017 2 01/31/2017 3 03/31/2017 m1 = ~df.duplicated(['merchandise_id'], keep='last') m2 = df.duplicated(['merchandise_id'], keep=False) m = m1 & m2 df.loc[m, 'new'] = 'succeed' print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date new 0 01/31/2017 NaN 1 01/31/2017 NaN 2 01/31/2017 NaN 3 03/31/2017 succeed 
+1
source

All Articles