How to find the template and surrounding content in a very large SINGLE file?

Question

How to find the template and surrounding content in a very large SINGLE file?

I have a very large 100Mb + file, where all the content is on the same line. I want to find a pattern in this file and a few characters around this pattern.

For example, I would like to invoke a command similar to the one below, but where -A and -B are the number of bytes of non-lines:

cat very_large_file | grep -A 100 -B 100 somepattern

So, for a file containing such content:

1234567890abcdefghijklmnopqrstuvwxyz

With an image

890abc
and a before size of -B 3 
and an after size of -A 3

I want him to return:

567890abcdef

Any advice would be great. Many thanks.

+5

bash parsing

emson Oct 3 '11 at 18:38

source share

3 answers

sed ( , GNU grep ):

sed -n '
  s/.*\(...890abc...\).*/\1/p
  ' infile

+4

Dimitre Radoulov 03 . '11 19:29

The best way I can think this is done with a tiny Perl script.

#!/usr/bin/perl
$pattern = $ARGV[0];
$before = $ARGV[1];
$after = $ARGV[2];

while(<>) {
  print $& if( /.{$before}$pattern.{$after}/ );
}

Then you execute it as follows:

cat very_large_file | ./myPerlScript.pl 890abc 3 3

EDIT: Dang, Paolo's solution is much simpler. Well, viva la Perl!

+3

Chriszuma Oct 3 '11 at 18:47

source share

Paolo tedesco · Accepted Answer · 2011-10-03T18:44:16+0000

You can try the -o option:

-o, --only-matching
      Show only the part of a matching line that matches PATTERN.

and use regex to match your pattern and 3 previous / next characters i.e.

grep -o -P ".{3}pattern.{3}" very_large_file

In the example you specified, it will be

echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt

How to find the template and surrounding content in a very large SINGLE file?

More articles: