How to find the template and surrounding content in a very large SINGLE file?

I have a very large 100Mb + file, where all the content is on the same line. I want to find a pattern in this file and a few characters around this pattern.

For example, I would like to invoke a command similar to the one below, but where -A and -B are the number of bytes of non-lines:

cat very_large_file | grep -A 100 -B 100 somepattern

So, for a file containing such content:

1234567890abcdefghijklmnopqrstuvwxyz

With an image

890abc
and a before size of -B 3 
and an after size of -A 3

I want him to return:

567890abcdef

Any advice would be great. Many thanks.

+5
source share
3 answers

You can try the -o option:

-o, --only-matching
      Show only the part of a matching line that matches PATTERN.

and use regex to match your pattern and 3 previous / next characters i.e.

grep -o -P ".{3}pattern.{3}" very_large_file 

In the example you specified, it will be

echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt
+11

sed ( , GNU grep ):

sed -n '
  s/.*\(...890abc...\).*/\1/p
  ' infile
+4

The best way I can think this is done with a tiny Perl script.

#!/usr/bin/perl
$pattern = $ARGV[0];
$before = $ARGV[1];
$after = $ARGV[2];

while(<>) {
  print $& if( /.{$before}$pattern.{$after}/ );
}

Then you execute it as follows:

cat very_large_file | ./myPerlScript.pl 890abc 3 3

EDIT: Dang, Paolo's solution is much simpler. Well, viva la Perl!

+3
source

All Articles