How to check if one file is part of another?

Question

How to check if one file is part of another?

I need to check if one file is inside another file using a bash script. For a given multi-line pattern and input file.

Return value:

I want to get the status (as in the grep command) 0 if a match is found, 1 if no matches are found.

template:

multi-line
the order of the lines is important (considered as one block of lines),
includes characters such as numbers, letters,?, &, *, #, etc.,

Explanation

Only the following examples should find a match:

pattern file1 file2 file3 file4 222 111 111 222 222 333 222 222 333 333 333 333 444 444

following follows:

 pattern file1 file2 file3 file4 file5 file6 file7 222 111 111 333 *222 111 111 222 333 *222 222 222 *333 222 222 333 333* 444 111 333 444 333 333

Here is my script:

 #!/bin/bash function writeToFile { if [ -w "$1" ] ; then echo "$2" >> "$1" else echo -e "$2" | sudo tee -a "$1" > /dev/null fi } function writeOnceToFile { pcregrep --color -M "$2" "$1" #echo $? if [ $? -eq 0 ]; then echo This file contains text that was added previously else writeToFile "$1" "$2" fi } file=file.txt #1?1 #2?2 #3?3 #4?4 pattern=`cat pattern.txt` #2?2 #3?3 writeOnceToFile "$file" "$pattern"

I can use the grep command for all lines of the template, but this fails in this example:

 file.txt #1?1 #2?2 #=== added line #3?3 #4?4 pattern.txt #2?2 #3?3

or even if you change lines: 2 out of 3

 file=file.txt #1?1 #3?3 #2?2 #4?4

returns 0 when it should not.

How can i fix this? Please note that I prefer to use my own installed programs (if it can be without pcregrep). Maybe sed or awk can solve this problem?

+8

command-line linux bash pcregrep

user51390233 Jul 21 '15 at 13:45

source share

3 answers

I would just use diff for this task:

 diff pattern <(grep -f file pattern)

Explanation

diff file1 file2 tells you if the two files are different.
Speaking grep -f file pattern , you see what pattern content is in file .

So what you do is check which lines from pattern are in file and then compare this to pattern . If they match, it means that pattern is a subset of file !

Test

seq 10 is part of seq 20 ! Let it check:

 $ diff <(seq 10) <(grep -f <(seq 20) <(seq 10)) $

seq 10 not exactly inside seq 2 20 (1 is not in the second):

 $ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10)) Files /dev/fd/63 and /dev/fd/62 differ

+4

fedorqui Jul 21 '15 at 14:11

source share

I looked at the issue again, and I think awk do better with this:

 awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1 {for (i in a) len++} {for (i=last; i<=len; i++) { if (a[i]==$0) {last=i; next} } status=1} END {print status+0}' file pattern

The idea is this: - Read the entire file file in memory in the array a[line_number] = line . - Count the elements in the array. - Scroll through the pattern file and check if the current line in file exists at any time between where the cursor is and the end of file file . If it matches, move the cursor to the position where it was found. If this does not happen, set the status to 1 - that is, a line in pattern that did not occur in file after the previous match. - Print the status, which will be 0 if it has not been previously set to 1 .

Test

They correspond to:

 $ tail fp ==> f <== 222 333 555 ==> p <== 222 333 $ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' fp 0

They do not:

 $ tail fp ==> f <== 333 222 555 ==> p <== 222 333 $ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' fp 1

With seq :

 $ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10) 1 $ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10) 0

+2

fedorqui Jul 22 '15 at 11:08

source share

Peter Cordes · Accepted Answer · 2015-07-22T05:50:09+0000

I have a working version using perl.

I thought this worked with GNU awk , but I did not. RS = An empty line is broken into empty lines. See Change History for a broken version of awk.

How to search for a multi-line pattern in a file? shows how to use pcregrep, but I don’t see a way to make it work when the search pattern may contain special regular expression characters. -F fixed-line mode works inefficiently with multi-line mode: it still treats the template as a set of lines that must be matched separately. (Not like a multi-line fixed string that needs to be matched.) I see you already used pcregrep in your attempt.

By the way, I think you have an error in your code in the case of non-sudo:

 function writeToFile { if [ -w "$1" ] ; then "$2" >> "$1" # probably you mean echo "$2" >> "$1" else echo -e "$2" | sudo tee -a "$1" > /dev/null fi }

In any case, attempts to use linear tools have failed, so it's time to pull out a more serious programming language that does not force the agreement on a new line to us. Just read both files in variables and use search without regular expression:

 #!/usr/bin/perl -w # multi_line_match.pl pattern_file target_file # exit(0) if a match is found, else exit(1) #use IO::File; use File::Slurp; my $pat = read_file($ARGV[0]); my $target = read_file($ARGV[1]); if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) { exit(0); } exit(1);

See What is the best way to cut a file into a string in Perl? to avoid dependency on File::Slurp (which is not part of the standard perl distro, or by default Ubuntu 15.04). I went for File :: Slurp in part for readability, which the program does for non-perl-geeks compared to:

 my $contents = do { local(@ARGV, $/) = $file; <> };

I worked on not reading the complete file in memory, with the idea of http://www.perlmonks.org/?node_id=98208 . I think inconsistencies will usually read the entire file at once. In addition, the logic was quite complicated to handle the match at the beginning of the file, and I did not want to spend a lot of time testing to make sure that this was correct for all cases. Here is what I had before giving up:

 #IO::File->input_record_separator($pat); $/ = $pat; # pat must include a trailing newline if you want it to match one my $fh = IO::File->new($ARGV[2], O_RDONLY) or die 'Could not open file ', $ARGV[2], ": $!"; $tail = substr($fh->getline, -1); #fast forward to the first match #print each occurence in the file #print IO::File->input_record_separator while $fh->getline; #FIXME: something clever here to handle the case where $pat matches at the beginning of the file. do { # fixme: need to check defined($fh->getline) if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) { exit(0); # if there a 2nd line } } while($tail); exit(1); $fh->close;

Another idea was to filter the templates and files for searching through tr '\n' '\r' or something else, so all of them would be single-line. ( \r is a likely safe choice that would not come across anything already in the file or template.)

How to check if one file is part of another?

Explanation

Test

Test

More articles: