Awk / sed / grep to delete lines matching fields in another file

Question

Awk / sed / grep to delete lines matching fields in another file

I have file1 that has several lines (tens) and a much longer file2 (~ 500,000 lines). The lines in each file are not identical, although there are a subset of the fields that are identical. I want to take fields 3-5 from each line in file1 and look for file2 for the same template (only these three fields, in the same order - in file2, they fall into fields 2-4). If a match is found, I want to delete the corresponding line from file1.

For example, file1:

2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current 2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current 2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt 2016-01-06T07:53:50 2016-01-06T07:52:14 2016006 090E A TM Current 2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

file2:

 2016-01-06T07:35:06.87 2016003 100E C NN Current 0 2016-01-06T07:35:09.97 2016003 100E B TM Current 6303 2016-01-06T07:36:23.12 2016004 030N C TM Current 0 2016-01-06T07:37:57.36 2016006 090E A TM Current 399 2016-01-06T07:40:29.61 2016006 010N C TM Current 0

... (and for 500,000 lines)

So, in this case, I want to delete the fourth line of file1 (in place).

Below find the lines that I want to delete:

 grep "$(awk '{print $3,$4,$5}' file1)" file2

So one solution might be to pass this message to sed, but I don’t understand how to set the matching pattern to sed from a nested stream. And an internet search suggests that awk might possibly do all of this (or maybe sed or something else), so one wonders what a clean solution would look like.

In addition, speed is somewhat important because other processes may try to modify files during this process (I know that this can cause more complications ...). Matches will usually be found at the end of file2, and not at the beginning (in case there is a way to search for file2 from bottom to top).

+6

linux bash awk sed

trid3 Jan 6 '16 at 16:32

source share

2 answers

You can find inappropriate lines in file1 :

 $ grep -v -F -f <(awk '{ print $3,$4,$5 }' file2) file1 2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current 2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current 2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt 2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

Just redirect this somewhere and overwrite file1 after.

+1

chrisaycock Jan 6 '16 at 16:37

source share

Ed morton · Accepted Answer · 2016-01-06T16:45:43+0000

 $ awk 'NR==FNR{file2[$2,$3,$4]; next} !(($3,$4,$5) in file2)' file2 file1 2016-01-06T05:38:31 2016-01-06T05:23:33 2016006 120E A TM Current 2016-01-06T07:34:01 2016-01-06T07:01:51 2016006 090E B TM Current 2016-01-06T07:40:44 2016-01-06T07:40:41 2016006 080E A TM Alt 2016-01-06T08:14:45 2016-01-06T08:06:33 2016006 080E C TM Current

The fact that file2 contains 500,000 lines should not be a problem for awk memory or execution speed - it should complete in about 1 second or less, even in the worst case.

With any UNIX command to overwrite the source file you just made:

 cmd file > tmp && mv tmp file

therefore in this case:

 awk '...' file2 file1 > tmp && mv tmp file1

Awk / sed / grep to delete lines matching fields in another file

More articles: