Displaying the difference in two Pandas frames side by side - highlighting the difference

I am trying to pinpoint exactly what has changed between two data frames.

Suppose I have two Pandas Python data frames

:
"StudentRoster Jan-1": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.11 False Graduated 113 Zoe 4.12 True "StudentRoster Jan-2": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.21 False Graduated 113 Zoe 4.12 False On vacation 

My goal is to display an HTML table that:

  • Identifies strings that have changed (may be int, float, boolean, string)
  • Displays rows with the same OLD and NEW values ​​(ideally in an HTML table) so that the consumer can clearly see what has changed between two data frames:

     "StudentRoster Difference Jan-1 - Jan-2": id Name score isEnrolled Comment 112 Nick was 1.11| now 1.21 False Graduated 113 Zoe 4.12 was True | now False was "" | now "On vacation" 

I suppose I could do a row-by-row and column-by-column comparison of columns, but is there an easier way?

+56
python html pandas dataframe panel
Jun 13 '13 at 19:08
source share
9 answers

The first part is similar to Constantine, you can get a boolean whose lines are empty *:

 In [21]: ne = (df1 != df2).any(1) In [22]: ne Out[22]: 0 False 1 True 2 True dtype: bool 

Then we see which entries have changed:

 In [23]: ne_stacked = (df1 != df2).stack() In [24]: changed = ne_stacked[ne_stacked] In [25]: changed.index.names = ['id', 'col'] In [26]: changed Out[26]: id col 1 score True 2 isEnrolled True Comment True dtype: bool 

Here, the first record is the index, and the second is the columns that were changed.

 In [27]: difference_locations = np.where(df1 != df2) In [28]: changed_from = df1.values[difference_locations] In [29]: changed_to = df2.values[difference_locations] In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index) Out[30]: from to id col 1 score 1.11 1.21 2 isEnrolled True False Comment None On vacation 

* Note: it is important that df1 and df2 use the same index. To overcome this ambiguity, you can only view common tags with df1.index & df2.index , but I think I will leave this as an exercise.

+62
Jun 13 '13 at 19:39
source share
β€” -

I ran into this problem but found the answer before finding this post:

Based on unutbu's answer, upload your data ...

 import pandas as pd import io texts = ['''\ id Name score isEnrolled Date 111 Jack True 2013-05-01 12:00:00 112 Nick 1.11 False 2013-05-12 15:05:23 Zoe 4.12 True ''', '''\ id Name score isEnrolled Date 111 Jack 2.17 True 2013-05-01 12:00:00 112 Nick 1.21 False Zoe 4.12 False 2013-05-01 12:00:00'''] df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4]) df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4]) 

... define your diff function ...

 def report_diff(x): return x[0] if x[0] == x[1] else '{} | {}'.format(*x) 

Then you can simply use the panel to conclude:

 my_panel = pd.Panel(dict(df1=df1,df2=df2)) print my_panel.apply(report_diff, axis=0) # id Name score isEnrolled Date #0 111 Jack nan | 2.17 True 2013-05-01 12:00:00 #1 112 Nick 1.11 | 1.21 False 2013-05-12 15:05:23 | NaT #2 nan | nan Zoe 4.12 True | False NaT | 2013-05-01 12:00:00 

By the way, if you are in an IPython laptop, you might like to use the diff color function to give colors depending on whether the cells are different, equal or left / right:

 from IPython.display import HTML pd.options.display.max_colwidth = 500 # You need this, otherwise pandas # will limit your HTML strings to 50 characters def report_diff(x): if x[0]==x[1]: return unicode(x[0].__str__()) elif pd.isnull(x[0]) and pd.isnull(x[1]): return u'<table style="background-color:#00ff00;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan') elif pd.isnull(x[0]) and ~pd.isnull(x[1]): return u'<table style="background-color:#ffff00;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1]) elif ~pd.isnull(x[0]) and pd.isnull(x[1]): return u'<table style="background-color:#0000ff;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan') else: return u'<table style="background-color:#ff0000;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1]) HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False)) 
+16
Apr 15 '14 at 15:57
source share
 import pandas as pd import io texts = ['''\ id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.11 False Graduated 113 Zoe 4.12 True ''', '''\ id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.21 False Graduated 113 Zoe 4.12 False On vacation'''] df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,21,20]) df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,21,20]) df = pd.concat([df1,df2]) print(df) # id Name score isEnrolled Comment # 0 111 Jack 2.17 True He was late to class # 1 112 Nick 1.11 False Graduated # 2 113 Zoe 4.12 True NaN # 0 111 Jack 2.17 True He was late to class # 1 112 Nick 1.21 False Graduated # 2 113 Zoe 4.12 False On vacation df.set_index(['id', 'Name'], inplace=True) print(df) # score isEnrolled Comment # id Name # 111 Jack 2.17 True He was late to class # 112 Nick 1.11 False Graduated # 113 Zoe 4.12 True NaN # 111 Jack 2.17 True He was late to class # 112 Nick 1.21 False Graduated # 113 Zoe 4.12 False On vacation def report_diff(x): return x[0] if x[0] == x[1] else '{} | {}'.format(*x) changes = df.groupby(level=['id', 'Name']).agg(report_diff) print(changes) 

prints

  score isEnrolled Comment id Name 111 Jack 2.17 True He was late to class 112 Nick 1.11 | 1.21 False Graduated 113 Zoe 4.12 True | False nan | On vacation 
+9
Jun 13 '13 at 20:41
source share

This answer simply extends @Andy Hayden's, making it stable when the numeric fields are nan and complete it into a function.

 import pandas as pd import numpy as np def diff_pd(df1, df2): """Identify differences between two pandas DataFrames""" assert (df1.columns == df2.columns).all(), \ "DataFrame column names are different" if df1.equals(df2): return None else: # need to account for np.nan != np.nan returning True diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull()) ne_stacked = diff_mask.stack() changed = ne_stacked[ne_stacked] changed.index.names = ['id', 'col'] difference_locations = np.where(diff_mask) changed_from = df1.values[difference_locations] changed_to = df2.values[difference_locations] return pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index) 

So your data (slightly edited to have NaN in the rating column):

 import sys if sys.version_info[0] < 3: from StringIO import StringIO else: from io import StringIO DF1 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.11 False "Graduated" 113 Zoe NaN True " " """) DF2 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.21 False "Graduated" 113 Zoe NaN False "On vacation" """) df1 = pd.read_table(DF1, sep='\s+', index_col='id') df2 = pd.read_table(DF2, sep='\s+', index_col='id') diff_pd(df1, df2) 

Output:

  from to id col 112 score 1.11 1.21 113 isEnrolled True False Comment On vacation 
+9
Jul 17 '16 at 13:12
source share

Emphasizing the difference between two DataFrames

You can use the DataFrame style property to highlight the background color in cells where there is a difference.

Using the example data from the original question

The first step is to combine the DataFrames horizontally using the concat function and select each frame with the keys parameter:

 df_all = pd.concat([df.set_index('id'), df2.set_index('id')], axis='columns', keys=['First', 'Second']) df_all 

enter image description here

It is probably easier to replace the column levels and put the same column names next to each other:

 df_final = df_all.swaplevel(axis='columns')[df.columns[1:]] df_final 

enter image description here

Now it’s much easier to identify differences in frames. But we can go further and use the style property to select cells that are different. We define a custom function that you can see in this part of the documentation .

 def highlight_diff(data, color='yellow'): attr = 'background-color: {}'.format(color) other = data.xs('First', axis='columns', level=-1) return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''), index=data.index, columns=data.columns) df_final.style.apply(highlight_diff, axis=None) 

enter image description here

This will highlight cells that have missing values. You can fill them out or provide additional logic so that they do not stand out.

+8
Nov 04 '17 at 14:58
source share

If your two data blocks have the same identifiers in them, then figuring out what has changed is actually quite simple. Just executing frame1 != frame2 will give you a logical DataFrame, where each True is the data that has been changed. From this, you can easily get the index of each changed row by doing changedids = frame1.index[np.any(frame1 != frame2,axis=1)] .

+4
Jun 13 '13 at 7:23
source share

An extension of @cge's answer, which is pretty cool for greater readability of the result:

 a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join( b[a != b][np.any(a != b, axis=1)] ,rsuffix='_b', how='outer' ).fillna('') 

Full demo:

 a = DataFrame(np.random.randn(7,3), columns=list('ABC')) b = a.copy() b.iloc[0,2] = np.nan b.iloc[1,0] = 7 b.iloc[3,1] = 77 b.iloc[4,2] = 777 a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join( b[a != b][np.any(a != b, axis=1)] ,rsuffix='_b', how='outer' ).fillna('') 
+2
Mar 16 '16 at 10:24
source share

Another approach using concat and drop_duplicates:

 import sys if sys.version_info[0] < 3: from StringIO import StringIO else: from io import StringIO import pandas as pd DF1 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.11 False "Graduated" 113 Zoe NaN True " " """) DF2 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.21 False "Graduated" 113 Zoe NaN False "On vacation" """) df1 = pd.read_table(DF1, sep='\s+', index_col='id') df2 = pd.read_table(DF2, sep='\s+', index_col='id') #%% dictionary = {1:df1,2:df2} df=pd.concat(dictionary) df.drop_duplicates(keep=False) 

Output:

  Name score isEnrolled Comment id 1 112 Nick 1.11 False Graduated 113 Zoe NaN True 2 112 Nick 1.21 False Graduated 113 Zoe NaN False On vacation 
+2
Mar 07 '17 at 13:14
source share

Here is another way to use select and merge:

 In [6]: # first lets create some dummy dataframes with some column(s) different ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)}) ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))}) In [7]: df1 Out[7]: abc 0 -5 10 20 1 -4 11 21 2 -3 12 22 3 -2 13 23 4 -1 14 24 In [8]: df2 Out[8]: abc 0 -5 10 20 1 -4 11 101 2 -3 12 102 3 -2 13 103 4 -1 14 104 In [10]: # make condition over the columns you want to comapre ...: condition = df1['c'] != df2['c'] ...: ...: # select rows from each dataframe where the condition holds ...: diff1 = df1[condition] ...: diff2 = df2[condition] In [11]: # merge the selected rows (dataframes) with some suffixes (optional) ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after')) Out[11]: ab c_before c_after 0 -4 11 21 101 1 -3 12 22 102 2 -2 13 23 103 3 -1 14 24 104 

Here is the same from the Jupyter snapshot:

enter image description here

0
Jun 07 '17 at 6:46
source share



All Articles