Is there a really universal template in Grepe?

Question

Is there a really universal template in Grepe?

Really the main question here. Therefore, they tell me that point. matches any character EXCEPT a line break. I am looking for something that matches any character, including line breaks.

All I want to do is grab all the text on a website page between two specific lines, removing the header and footer. Something like HEADER TEXT (. +) FOOTER TEXT, and then extract what is in brackets, but I cannot find a way to include all the text and line breaks between the header and footer, does that make sense? Thanks in advance!

+6

regex bbedit

Tom b Dec 13 '09 at 19:04

source share

7 answers

You can do this with Perl:

 $ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html

To print only text between separators, use

 $ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html

The /s switch matches the regular expression treats the entire line as the string s , which means that the period matches newline characters, and /g means the match as many times as possible.

The above examples assume that you are cranking HTML files on a local drive. If you need to get them first, use get from LWP::Simple :

 $ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com"; print $1 while m!<head>(.+?)</head>!sg'

Please note that parsing HTML with regular expressions as described above does not work in the general case! If you work with a fast and dirty scanner, great, but for the application that you need to be more reliable, use a real parser.

+3

Greg bacon Dec 13 '09 at 19:09

source share

By definition, grep searches for strings that match; he reads the line, sees if it matches, and prints the line.

One possible way to do what you need is sed :

 sed -n '/HEADER TEXT/,/FOOTER TEXT/p' " $@ "

This prints from the first line that matches "HEADER TEXT", to the first line that corresponds to "NIGHT PHONE", and then iterates; '-n' stops the default operation for printing each line. This will not work if the header and footer text appear on the same line.

To do what you want, I would probably use perl (but you could use Python if you want). I would think about deleting the whole file, and then use a suitable regular expression to find the corresponding parts of the file. However, the single-line Perl specified by "@gbacon" represents an almost accurate transliteration into the Perl "sed" script above and is more accurate than overlay.

+3

Jonathan leffler Dec 13 '09 at 19:12

source share

The grep man page says:

grep, egrep, fgrep, rgrep - print lines matching the pattern

grep not intended to match more than one line. You should try to solve this problem with perl or awk .

+2

tangens Dec 13 '09 at 19:11

source share

Since this is marked as "bbedit" and BBedit supports Perl-Style template modifiers, you can let the dots match line breaks with the (? S) switch

(? FROM).

will match ANY character. And yes, (? S) +. will match the entire text.

+2

kaidoh Aug 9 '11 at 12:05

source share

As indicated elsewhere, grep will work for individual lines.

For multiple lines (in ruby using Regexp :: MULTILINE or in python, awk, sed, whatever), "\ s" should also capture line breaks, so

 HEADER TEXT(.*\s*)FOOTER TEXT

can work...

+1

phtrivier Dec 13 '09 at 19:09

source share

here is one way to do it with gawk if you have

 awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

0

ghostdog74 Dec 14 '09 at 12:02

source share

Rubens farias · Accepted Answer · 2009-12-13T19:16:20+0000

When I need to match multiple characters, including line breaks, I do:

[\s\S]*?

Note. I am using an unwanted pattern

Is there a really universal template in Grepe?

More articles: