Text file intersection

How can I calculate the intersection of two text files in terms of the source text? It doesn't matter if the solution uses a shell command or is expressed in Python, Elisp, or other common scripting languages.

I know comm and grep -Fxv -f file1 file2 . Both suggest that I'm interested in intersecting lines, while I'm interested in intersecting characters (with the minimum number of characters needed to count as a match).

Bonus points for efficiency.

Example

If file 1 contains

 foo bar baz-fee 

and file 2 contains

 fee foo bar-faa 

then i would like to see

  • foo bar
  • fee

assuming a minimum match length of 3.

+4
source share
3 answers

You are looking for the Python difflib module (in the standard library) and, in particular, difflib.SequenceMatcher .

+7
source

okay here is a very simple python script to do this

it may be integral, but must do the job.

temp.txt

xx yy xyz zz aa
xx yy xyz zz aa
xx yy xyz zz aa
xx yy 111 aa cc

temp2.txt

yy aa cc dd
ff xx ee 11
oo mm aa tt

common.py

 #!/usr/bin/python import sys def main(): f1,f2 = tryOpen(sys.argv[1],sys.argv[2]) commonWords(f1,f2) f1.close() f2.close() def tryOpen(fn1,fn2): try: f1 = open(fn1, 'r') f2 = open(fn2, 'r') return f1,f2 except Exception as e: print('Oh No! => %s' %e) sys.exit(2) #Unix programs generally use 2 for #command line syntax errors # and 1 for all other kind of errors. def commonWords(f1,f2): words = [] for line in f1: for word in line.strip().split(): words.append(word) for line in f2: for word in line.strip().split(): if word in words: print 'common word found => %s' % word if __name__ == '__main__': main() 

Output

 ./common.py temp.txt temp2.txt common word found => yy common word found => aa common word found => cc common word found => xx common word found => aa 
+1
source

You can try messing up the options for diff: http://ss64.com/bash/diff.html

I still do not understand what exactly you are asking. What is a word in your definition? And how is this crossing process defined here?

0
source

All Articles