How to determine and set the column value only for the last occurrence of a repeating row

Question

How to determine and set the column value only for the last occurrence of a repeating row

I am very new to Pandas and Python, so forgive me if this is the main question. In an attempt to solve my problem: Download a few csv files, find the missing product identifier in the subsequent files, calculate the sold date based on it , I made some changes to the way I clean these files. I have the following columns in a data frame loaded from multiple csv files.

store_id stock_number merchandise_id date_acquired color price MSRP csv_date 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 01/31/2017 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 01/31/2017 ... 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 03/31/2017

The last column is the last occurrence of the item with the item 'YHAN79807'. I managed to find the last event, following How to detect the first occurrence of duplicate rows in the Python Pandas Dataframe and change it a bit. I used

  df1['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1])

However, I want to set this value for the dup_index column only for the last occurrence of YHAN79807 as the product identifier. I don’t want the rest of the duplicate rows for “YHAN79807” to be like the product ID to have this value. They must be empty. This identifier should only have the last case. I have not been able to do this yet. I tried several things, one of them:

 group = df1.groupby(['merchandiseID']) df1_index = df1.set_index(['merchandiseID']) df1[ (((len(group.indices[ind])-1)==group.indices[df1.merchandiseID])]['dup_index'] = 'succeed'

I tried adding “success” as a first step to see if comparing the columns would produce the result, but this gave me the following error:

  FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

result = getattr (x, name) (y) ... raise TypeError ('Cannot compare% s with series'%

I am on my way. What am I missing? Any pointers are appreciated.

it's better

Alice

+1

python pandas

Alice Mar 09 '18 at 7:32

source share

1 answer

jezrael · Accepted Answer · 2018-03-09T07:39:22+0000

I think you need:

 g = df.groupby(['merchandise_id']) df1 = df.set_index(['merchandise_id']) df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][len(g.indices[ind])-1]) print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date dup_index 0 01/31/2017 0 1 01/31/2017 2 2 03/31/2017 2

Or, if you want to determine only the last duplicated lines, use double conditions with & :

 print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date 0 01/31/2017 1 01/31/2017 2 01/31/2017 3 03/31/2017 m1 = ~df.duplicated(['merchandise_id'], keep='last') m2 = df.duplicated(['merchandise_id'], keep=False) m = m1 & m2 df.loc[m, 'new'] = 'succeed' print (df) store_id stock_number merchandise_id date_acquired color price MSRP \ 0 12973 7382 UISN78008 04/11/2017 Red $3200 $3650 1 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 2 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 3 45973 9889 YHAN79807 08/09/2017 White $3600 $3650 csv_date new 0 01/31/2017 NaN 1 01/31/2017 NaN 2 01/31/2017 NaN 3 03/31/2017 succeed

How to determine and set the column value only for the last occurrence of a repeating row

More articles: