File intersection

I have two large files (27k lines and 450k lines). They look something like this:

File1: 1 2 A 5 3 2 B 7 6 3 C 8 ... File2: 4 2 C 5 7 2 B 7 6 8 B 8 7 7 F 9 ... 

I need lines from both files in which the third column is in both files (note lines with A and F were excluded):

 OUTPUT: 3 2 B 7 6 3 C 8 4 2 C 5 7 2 B 7 6 8 B 8 

What is the best way?

+6
source share
4 answers
 awk '{print $3}' file1 | sort | uniq > file1col3 awk '{print $3}' file2 | sort | uniq > file2col3 grep -Fx -f file1col3 file2col3 | awk '{print "\\w+ \\w+ " $1 " \\w+"}' > col3regexp egrep -xh -f col3regexp file1 file2 

Captures the entire unique column 3 in two files, traverses them (using grep -F ), prints a bunch of regular expressions that will match the required columns, and then uses egrep to extract them from two files.

+2
source

first we sort the files in the third field:

 sort -k 3 file1 > file1.sorted sort -k 3 file2 > file2.sorted 

then we get the general values ​​in the third field using comm:

 comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field 

Now we can join each sorted file by common values:

 join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined 

the output is formed, so we get the same field order as the one used in the files. The standard unix tools are used: sort, comm, cut, uniq, join. <( ) works with bash, for other shells you can use temporary files.

+3
source

The grep, sed, and cut options are used here.

Extract column 3:

 cut -d' ' -f3 file1 > f1c cut -d' ' -f3 file2 > f2c 

Find the relevant lines in file1 :

 grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1 > out 

Find the relevant lines in file2 :

 grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out 

Conclusion:

 3 2 B 7 6 3 C 8 4 2 C 5 7 2 B 7 6 8 B 8 

Update

If you have asymmetric data files, and the smaller one into memory, this one-pass awk solution will be quite efficient:

parse.awk

 FNR == NR { a[$3] = $0 p[$3] = 1 next } a[$3] p[$3] { print a[$3] delete p[$3] } 

Run it as follows:

 awk -f parse.awk file1 file2 

Where file1 is the smaller of the two.

Explanation

  • The FNR == NR block reads file1 into two hashes.
  • a[$3] prints the string file2 if $3 is the key in a .
  • p[$3] prints the string file1 if $3 is the key in p and deletes the key (only once).
+3
source

First get the general values ​​from the third column. Then filter the rows from both files that have the corresponding third column.

If the columns are separated by a single character, you can use cut to extract a single column. For columns that can be separated by any number of spaces, use awk . One way to get the general values ​​of column 3 is to extract the columns, sort them, and call comm . Using bash / ksh / zsh substitutions:

 comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u) 

Now turn them into grep templates and filter.

 comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u) | sed -e 's/[][.\|?*+^$]/\\&/g' \ -e 's/.*/^[^[:space]]+[[:space]]+[^[:space]]+[[:space]]+\1[[:space]]/' | grep -E -f - file1 file2 

The method above should work well with huge files. But for 500 thousand lines, you do not have huge files. These files should be conveniently located in memory, and a simple Perl solution will be fine. Download both files, extract the column values, print the corresponding columns.

 perl -n -e ' @lines += 1; $c = (split)[2]; $seen{$c}{$ARGV} = 1; END { foreach (@lines) { $c = (split)[2]; print if %{$seen{$c}} == 2; } }' file1 file2 
+1
source

Source: https://habr.com/ru/post/925486/


All Articles