Create a "fuzzy" difference of two files in Python, with an approximate comparison of floats

Question

Create a "fuzzy" difference of two files in Python, with an approximate comparison of floats

I have a problem comparing two files. Basically, what I want to do is a UNIX-like diff between two files, for example:

$ diff -u left-file right-file

However, my two files contain a float; and since these files were generated on different architectures (but calculating the same things), the floating values are not exactly the same (they may differ, for example, from 1e-10). But what I'm trying to “distinguish” the files is to find what I consider to be significant differences (for example, the difference is greater than 1e-4); when using the diff command UNIX, I get almost all of my lines containing the values of a floating variable! This is my problem: how can I get the resulting diff, for example diff -u, but with less restrictions regarding the comparison of floats?

I thought I would write a Python script to do this, and recognized the difflib module, which provides a comparison with a similar comparison. But the documentation I found explains how to use it as is (using one method) and explains the internal objects, but I cannot find anything about how to configure the difflib object to meet my needs (for example, rewrite only the comparison method or such) ... I think that the solution may be to extract a unified difference and analyze it manually to remove my "false" differences, as this is not elegant; I would prefer to use the existing infrastructure.

So, does anyone know how to set up this library so that I can do what I'm looking for? Or at least point me in the right direction ... If not for Python, maybe the shell script might work?

Any help would be greatly appreciated! Thank you in advance for your answers!

+7

python floating-point fuzzy-comparison

piwi Jun 24 '10 at 8:23

source share

1 answer

smci · Answer 1 · 2011-07-03T01:07:15+0000

In your case, we specialize in the general case : before we pass things to difflib, we need to detect and separately process strings containing floats. Here is a basic approach, if you want to generate deltas, context lines, etc., you can build on that. Note that it’s easier to fuzzy-compare floats as actual floats, not strings (although you can encode columns by column, and ignore characters after 1-e4).

import re float_pat = re.compile('([+-]?\d*\.\d*)') def fuzzydiffer(line1,line2): """Perform fuzzy-diff on floats, else normal diff.""" floats1 = float_pat.findall(line1) if not floats1: pass # run your usual diff() else: floats2 = float_pat.findall(line2) for (f1,f2) in zip(floats1,floats2): (col1,col2) = line1.index(f1),line2.index(f2) if not fuzzy_float_cmp(f1,f2): print "Lines mismatch at col %d", col1, line1, line2 continue # or use a list comprehension like all(fuzzy_float_cmp(f1,f2) for f1,f2 in zip(float_pat.findall(line1),float_pat.findall(line2))) #return match def fuzzy_float_cmp(f1,f2,epsilon=1e-4): """Fuzzy-compare two strings representing floats.""" float1,float2 = float(f1),float(f2) return (abs(float1-float2) < epsilon)

Some tests:

 fuzzydiffer('text: 558.113509766 +23477547.6407 -0.867086648057 0.009291785451', 'text: 558.11351 +23477547.6406 -0.86708665 0.009292000001')

and as a bonus, here is a version that highlights diff-columns:

 import re float_pat = re.compile('([+-]?\d*\.\d*)') def fuzzydiffer(line1,line2): """Perform fuzzy-diff on floats, else normal diff.""" floats1 = float_pat.findall(line1) if not floats1: pass # run your usual diff() else: match = True coldiffs1 = ' '*len(line1) coldiffs2 = ' '*len(line2) floats2 = float_pat.findall(line2) for (f1,f2) in zip(floats1,floats2): (col1s,col2s) = line1.index(f1),line2.index(f2) col1e = col1s + len(f1) col2e = col2s + len(f2) if not fuzzy_float_cmp(f1,f2): match = False #print 'Lines mismatch:' coldiffs1 = coldiffs1[:col1s] + ('v'*len(f1)) + coldiffs1[col1e:] coldiffs2 = coldiffs2[:col2s] + ('^'*len(f2)) + coldiffs2[col2e:] #continue # if you only need to highlight first mismatch if not match: print 'Lines mismatch:' print ' ', coldiffs1 print '< ', line1 print '> ', line2 print ' ', coldiffs2 # or use a list comprehension like # all() #return True def fuzzy_float_cmp(f1,f2,epsilon=1e-4): """Fuzzy-compare two strings representing floats.""" print "Comparing:", f1, f2 float1,float2 = float(f1),float(f2) return (abs(float1-float2) < epsilon)

Create a "fuzzy" difference of two files in Python, with an approximate comparison of floats

More articles: