Find common items in> 2 files

I have three files as shown below

file1.txt

"aba" 0 0 "aba" 0 0 1 "abc" 0 1 "abd" 1 1 "xxx" 0 0 

file2.txt

 "xyz" 0 0 "aba" 0 0 0 0 "aba" 0 0 0 1 "xxx" 0 0 "abc" 1 1 

file3.txt

 "xyx" 0 0 "aba" 0 0 "aba" 0 1 0 "xxx" 0 0 0 1 "abc" 1 1 

I want to find similar elements in all three files based on the first two columns. To find similar elements in two files, I used something like

 awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

But how can we find similar elements in all files when the input files are more than 2? Can anyone help?

In the current awk solution, the output ignores duplicate columns of keys and gives the result as

 "xxx" 0 0 

Assuming the output comes from file1.txt, the expected result is:

 "aba" 0 0 "aba" 0 0 1 "xxx" 0 0 

ie it should get rows with duplicate key columns.

+7
source share
3 answers

Try the following generic solution for N files. It stores the data of the first file in a hash with a value of 1 , and for each hit from the following files, this value increases. At the end, I compare if the value of each key matches the number of processed files and prints only those that match.

 awk ' FNR == NR { arr[$1,$2] = 1; next } { if ( arr[$1,$2] ) { arr[$1,$2]++ } } END { for ( key in arr ) { if ( arr[key] != ARGC - 1 ) { continue } split( key, key_arr, SUBSEP ) printf "%s %s\n", key_arr[1], key_arr[2] } } ' file{1..3} 

This gives:

 "xxx" 0 "aba" 0 

EDIT to add a version that prints the entire line (see comments). I added another array with the same key, where I save the string, and also use it in the printf function. I missed the old code.

 awk ' ##FNR == NR { arr[$1,$2] = 1; next } FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next } { if ( arr[$1,$2] ) { arr[$1,$2]++ } } END { for ( key in arr ) { if ( arr[key] != ARGC - 1 ) { continue } ##split( key, key_arr, SUBSEP ) ##printf "%s %s\n", key_arr[1], key_arr[2] printf "%s\n", line[ key ] } } ' file{1..3} 

NEW EDIT (see comments) to add a version that processes multiple lines with the same key. Basically, I join all the records instead of saving only one, changing line[$1,$2] = $0 to line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0 . During printing, I do the reverse separation with the delimiter ( SUBSEP variable) and print each entry.

 awk ' FNR == NR { arr[$1,$2] = 1 line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0 next } FNR == 1 { delete found } { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } } END { num_files = ARGC -1 for ( key in arr ) { if ( arr[key] < num_files ) { continue } split( line[ key ], line_arr, SUBSEP ) for ( i = 1; i <= length( line_arr ); i++ ) { printf "%s\n", line_arr[ i ] } } } ' file{1..3} 

With the new data edited in question, it gives:

 "xxx" 0 0 "aba" 0 0 "aba" 0 0 1 
+3
source

This python script will list common lines among all files:

 import sys i,l = 0,[] for files in sys.argv[1:]: l.append(set()) for line in open(files): l[i].add(" ".join(line.split()[0:2])) i+=1 commonFields = reduce(lambda s1, s2: s1 & s2, l) for files in sys.argv[1:]: print "Common lines in ",files for line in open(files): for fields in commonFields: if fields in line: print line, break 

Usage: python script.py file1 file2 file3 ...

+1
source

For three files, all you need is:

 awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt 

The FNR==NR block returns true only for the first file in the argument list. The next statement in this block skips the rest of the code. Therefore ($1,$2) in a is executed for all files in the argument list, excluding the first. To process more files the way you need, all you have to do is list them.


If you need more powerful command line substitution, use extglob . You can enable it with shopt -s extglob and disable it with shopt -u extglob . For example:

 awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt) 

If you find it difficult to find files, use find . For example:

 awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt") 

I assume that you are looking for a glob range for the 'N' files. For example:

 awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt 
+1
source

All Articles