File intersection

Question

File intersection

I have two large files (27k lines and 450k lines). They look something like this:

File1: 1 2 A 5 3 2 B 7 6 3 C 8 ... File2: 4 2 C 5 7 2 B 7 6 8 B 8 7 7 F 9 ...

I need lines from both files in which the third column is in both files (note lines with A and F were excluded):

 OUTPUT: 3 2 B 7 6 3 C 8 4 2 C 5 7 2 B 7 6 8 B 8

What is the best way?

+6

algorithm unix file intersection

bdeonovic 15 sept. '12 at 10:54

source share

4 answers

first we sort the files in the third field:

 sort -k 3 file1 > file1.sorted sort -k 3 file2 > file2.sorted

then we get the general values in the third field using comm:

 comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

Now we can join each sorted file by common values:

 join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

the output is formed, so we get the same field order as the one used in the files. The standard unix tools are used: sort, comm, cut, uniq, join. <( ) works with bash, for other shells you can use temporary files.

+3

Kwariz 15 sept. '12 at 23:56

source share

The grep, sed, and cut options are used here.

Extract column 3:

 cut -d' ' -f3 file1 > f1c cut -d' ' -f3 file2 > f2c

Find the relevant lines in file1 :

 grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1 > out

Find the relevant lines in file2 :

 grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out

Conclusion:

 3 2 B 7 6 3 C 8 4 2 C 5 7 2 B 7 6 8 B 8

Update

If you have asymmetric data files, and the smaller one into memory, this one-pass awk solution will be quite efficient:

parse.awk

 FNR == NR { a[$3] = $0 p[$3] = 1 next } a[$3] p[$3] { print a[$3] delete p[$3] }

Run it as follows:

 awk -f parse.awk file1 file2

Where file1 is the smaller of the two.

Explanation

The FNR == NR block reads file1 into two hashes.
a[$3] prints the string file2 if $3 is the key in a .
p[$3] prints the string file1 if $3 is the key in p and deletes the key (only once).

+3

Thor Sep 16 '12 at 0:35

source share

First get the general values from the third column. Then filter the rows from both files that have the corresponding third column.

If the columns are separated by a single character, you can use cut to extract a single column. For columns that can be separated by any number of spaces, use awk . One way to get the general values of column 3 is to extract the columns, sort them, and call comm . Using bash / ksh / zsh substitutions:

 comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u)

Now turn them into grep templates and filter.

 comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u) | sed -e 's/[][.\|?*+^$]/\\&/g' \ -e 's/.*/^[^[:space]]+[[:space]]+[^[:space]]+[[:space]]+\1[[:space]]/' | grep -E -f - file1 file2

The method above should work well with huge files. But for 500 thousand lines, you do not have huge files. These files should be conveniently located in memory, and a simple Perl solution will be fine. Download both files, extract the column values, print the corresponding columns.

 perl -n -e ' @lines += 1; $c = (split)[2]; $seen{$c}{$ARGV} = 1; END { foreach (@lines) { $c = (split)[2]; print if %{$seen{$c}} == 2; } }' file1 file2

+1

Gilles 15 sept. '12 at 23:28

source share

Keith randall · Accepted Answer · 2012-09-15T23:17:44+0000

 awk '{print $3}' file1 | sort | uniq > file1col3 awk '{print $3}' file2 | sort | uniq > file2col3 grep -Fx -f file1col3 file2col3 | awk '{print "\\w+ \\w+ " $1 " \\w+"}' > col3regexp egrep -xh -f col3regexp file1 file2

Captures the entire unique column 3 in two files, traverses them (using grep -F ), prints a bunch of regular expressions that will match the required columns, and then uses egrep to extract them from two files.

File intersection

Update

More articles: