Replace duplicate column values in Pandas

Question

Replace duplicate column values in Pandas

I have a simple frame as such:

df = [ {'col1' : 'A', 'col2': 'B', 'col3': 'C', 'col4':'0'}, {'col1' : 'M', 'col2': '0', 'col3': 'M', 'col4':'0'}, {'col1' : 'B', 'col2': 'B', 'col3': '0', 'col4':'B'}, {'col1' : 'X', 'col2': '0', 'col3': 'Y', 'col4':'0'} ] df = pd.DataFrame(df) df = df[['col1', 'col2', 'col3', 'col4']] df

Which looks like this:

 | col1 | col2 | col3 | col4 | |------|------|------|------| | A | B | C | 0 | | M | 0 | M | 0 | | B | B | 0 | B | | X | 0 | Y | 0 |

I just want to replace duplicate characters with the character "0" line by line. This boils down to storing the first duplicate value we encounter, for example:

 | col1 | col2 | col3 | col4 | |------|------|------|------| | A | B | C | 0 | | M | 0 | 0 | 0 | | B | 0 | 0 | 0 | | X | 0 | Y | 0 |

It seems so simple, but I'm stuck. Any boosts in the right direction would really be appreciated.

+6

python pandas

Monica heddneck Oct 6 '16 at 23:49

source share

1 answer

maxymoo · Accepted Answer · 2016-10-07T00:32:49+0000

You can use the duplicated method to return a boolean indexer whether the elements are duplicate or not:

 In [214]: pd.Series(['M', '0', 'M', '0']).duplicated() Out[214]: 0 False 1 False 2 True 3 True dtype: bool

Then you can create a mask by matching it along the lines of your data frame and using where to perform your replacement:

 is_duplicate = df.apply(pd.Series.duplicated, axis=1) df.where(~is_duplicate, 0) col1 col2 col3 col4 0 ABC 0 1 M 0 0 0 2 B 0 0 0 3 X 0 Y 0

Replace duplicate column values ​​in Pandas

More articles:

Replace duplicate column values in Pandas