Sed regex stop on first match

I want to replace part of the following html text (a huge file excerpt) in order to update the formatting of the old forum (as a result of the very poor job of porting the forum 2 years ago) for the usual phpBB formatting:

<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i> 

this should be filtered by:

  [quote=User] 

I used the following regex in sed

  s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g 

this works on this example, but in the actual file, several such quotes may be on the same line. In this case, sed is too greedy and puts everything between the first and last match in the [quote = ...] tag. It seems that I can not replace it with all cases of the existence of this template in the string ... (I do not think there are any enclosed quotes, but that would make it even more difficult)

+4
source share
3 answers

You need a sed (1) version that uses regular expressions compatible with Perl, so you can do things like minimal match or with a negative look.

The easiest way to do this is to simply use Perl in the first place.

If you have an existing sed script, you can convert it to Perl using the s2p (1) utility. Note that in Perl, you really want to use $1 on the right side of the s/// operator. In most cases, \1 is grandfathered, but overall you want $1 there:

 s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g; 

Notice that I removed the backslash from the front of the parens. Another advantage of using Perl is that it uses regular egrep-style regular expressions (like awk), rather than ugly grep-style (like sed) that require all the confusing (and inconsistent) backslashes everywhere.

Another benefit of using Perl is that you can use pairwise separable delimiters to avoid ugly backslashes. For instance:

 s{<blockquote.*?written by (.*?)</i>} {[quote=$1]}g; 

Another advantage is that Perl works well with UTF-8 (now the coding form of most websites) and that you can do multi-line matches without the extreme pain sed requires. For instance:

 $ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ... 

-CSD allows treating stdin, stdout and files as UTF-8. -00 causes it to read the entire file into a single falling slurp, and /s makes the boundaries of the intersecting new line boundary as necessary.

+3
source

I don't think sed supports non-greedy matching. You can try perl though:

 perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename 
0
source

This might work for you:

 sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file 

Explanation:

  • If the line does not contain a pattern, skip it. /<blockquote.*written by .*<\/i>/!b
  • Change the front of the pattern to a new line globally along the line. s/<blockquote/\n/g
  • Replace the newline character globally and then the remaining pattern using [^\n]* instead of .* . s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
  • Return these lines that are not changed to the original template. s/\n/\<blockquote/g
0
source

All Articles