Sed regex stop on first match

Question

Sed regex stop on first match

I want to replace part of the following html text (a huge file excerpt) in order to update the formatting of the old forum (as a result of the very poor job of porting the forum 2 years ago) for the usual phpBB formatting:

&lt;blockquote id="quote"&gt;&lt;font size="1" face="Verdana, Arial, Helvetica" id="quote"&gt;quote:&lt;hr height="1" noshade id="quote"&gt;&lt;i&gt;written by User&lt;/i&gt;

this should be filtered by:

  [quote=User]

I used the following regex in sed

  s/&lt;blockquote.*written by \(.*\)&lt;\/i&gt;/[quote=\1]/g

this works on this example, but in the actual file, several such quotes may be on the same line. In this case, sed is too greedy and puts everything between the first and last match in the [quote = ...] tag. It seems that I can not replace it with all cases of the existence of this template in the string ... (I do not think there are any enclosed quotes, but that would make it even more difficult)

+4

regex perl sed phpbb

Ewout Jun 09 '12 at 20:43

source share

3 answers

I don't think sed supports non-greedy matching. You can try perl though:

 perl -pe 's/&lt;blockquote.*?written by \(.*\)&lt;\/i&gt;/[quote=\1]/g' filename

0

Hari menon Jun 09 '12 at 20:50

source share

This might work for you:

 sed '/&lt;blockquote.*written by .*&lt;\/i&gt;/!b;s/&lt;blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)&lt;\/i&gt;/[quote=\1]/g;s/\n/\&lt;blockquote/g' file

Explanation:

If the line does not contain a pattern, skip it. /<blockquote.*written by .*<\/i>/!b
Change the front of the pattern to a new line globally along the line. s/<blockquote/\n/g
Replace the newline character globally and then the remaining pattern using [^\n]* instead of .* . s/\n[^\n]*written by $[^\n]*$<\/i>/[quote=\1]/g
Return these lines that are not changed to the original template. s/\n/\<blockquote/g

0

potong Jun 09 '12 at 21:41

source share

tchrist · Accepted Answer · 2012-06-09T21:03:19+0000

You need a sed (1) version that uses regular expressions compatible with Perl, so you can do things like minimal match or with a negative look.

The easiest way to do this is to simply use Perl in the first place.

If you have an existing sed script, you can convert it to Perl using the s2p (1) utility. Note that in Perl, you really want to use $1 on the right side of the s/// operator. In most cases, \1 is grandfathered, but overall you want $1 there:

 s/&lt;blockquote.*?written by (.*?)&lt;\/i&gt;/[quote=$1]/g;

Notice that I removed the backslash from the front of the parens. Another advantage of using Perl is that it uses regular egrep-style regular expressions (like awk), rather than ugly grep-style (like sed) that require all the confusing (and inconsistent) backslashes everywhere.

Another benefit of using Perl is that you can use pairwise separable delimiters to avoid ugly backslashes. For instance:

 s{&lt;blockquote.*?written by (.*?)&lt;/i&gt;} {[quote=$1]}g;

Another advantage is that Perl works well with UTF-8 (now the coding form of most websites) and that you can do multi-line matches without the extreme pain sed requires. For instance:

 $ perl -CSD -00 -pe 's{&lt;blockquote.*?written by (.*?)&lt;/i&gt;}{[quote=$1]}gs' file1.utf8 file2.utf8 ...

-CSD allows treating stdin, stdout and files as UTF-8. -00 causes it to read the entire file into a single falling slurp, and /s makes the boundaries of the intersecting new line boundary as necessary.

Sed regex stop on first match

More articles: