Pandas "diff ()" with string

How can I mark a row in a data frame every time a column changes its string value?

Example:

Enter

ColumnA ColumnB 1 Blue 2 Blue 3 Red 4 Red 5 Yellow # diff won't work here with strings.... only works in numerical values dataframe['changed'] = dataframe['ColumnB'].diff() ColumnA ColumnB changed 1 Blue 0 2 Blue 0 3 Red 1 4 Red 0 5 Yellow 1 
+6
python pandas
source share
3 answers

I get better performance with ne instead of actually comparing != :

 df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int) 

Delay

Using the following setting to create a larger data block:

 df = pd.concat([df]*10**5, ignore_index=True) 

I get the following timings:

 %timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int) 10 loops, best of 3: 38.1 ms per loop %timeit (df.ColumnB != df.ColumnB.shift()).astype(int) 10 loops, best of 3: 77.7 ms per loop %timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB']) 10 loops, best of 3: 99.6 ms per loop %timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int) 10 loops, best of 3: 19.3 ms per loop 
+7
source share

Use .shift and compare:

 dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB']) 
+5
source share

Comparison with shift works for me, then NaN was replaced with 0 , because before that there was no value:

 df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int) df.ix[0,'diff'] = 0 print (df) ColumnA ColumnB diff 0 1 Blue 0 1 2 Blue 0 2 3 Red 1 3 4 Red 0 4 5 Yellow 1 

Change timings of another answer - the fastest is using ne :

 df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int) df.ix[0,'diff'] = 0 
+4
source share

All Articles