How to highlight duplicate repeating words with a Perl regular expression?

I want a Perl regex that matches duplicate words in a string.

Given the following input:

$str = "Thus joyful Troy Troy maintained the the watch of night..." 

I need the following output:

  Thus joyful [Troy Troy] maintained [the the] watch of night ... 
+6
regex perl
source share
4 answers

It works:

 $str =~ s/\b((\w+)\s+\2)\b/[\1]/g; 
+10
source share

This is similar to one of the Learning Perl exercises. The trick is to catch all duplicate words, so you need “one or more” duplication quantifiers:

  $str = 'This is Goethe the the the their sentence'; $str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g; 

The functions I'm going to use are described in perlre when they are applied to a template or perlop when they affect how the substitution operator does its job.

If you like the /x flag to add minor spaces and comments:

  $str =~ s/ \b ( (\w+) (?: \s+ \2 \b )+ ) /[\1]/xg; 

I don’t like it \2 , although I hate to consider relative positions. I can use relative backlinks in Perl 5.10. \g{-1} refers to the immediately preceding capture group:

  use 5.010; $str =~ s/ \b ( (\w+) (?: \s+ \g{-1} \b )+ ) /[\1]/xg; 

Counting is not that good either, so I can use tagged matches:

  use 5.010; $str =~ s/ \b ( (?<word>\w+) (?: \s+ \k<word> \b )+ ) /[\1]/xg; 

I can mark the first capture ( $1 ) and access its value in %+ later:

  use 5.010; $str =~ s/ \b (?<dups> (?<word>\w+) (?: \s+ \k<word> \b )+ ) /[$+{dups}]/xg; 

I do not need this first capture because it really just refers to everything that matches. Unfortunately, it seems that ${^MATCH} not installed early enough so that I can use it on the replacement side. I think a mistake. This should work, but does not work:

  $str =~ s/ \b (?<word>\w+) (?: \s+ \k<word> \b )+ /[${^MATCH}]/pgx; # DOESN'T WORK 

I test this on blead, but it will take a little time to compile my tiny machine.

+12
source share

You can try:

 $str = "Thus joyful Troy Troy maintained the the watch of night..."; $str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g; print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night... 

Used expression: \b(\w+)\s+\1\b

Explanation:

  • \b : word bondary
  • \w+ : word
  • () : remember the specified word
  • \s+ : spaces
  • \1 : catchy word

He effectively finds two complete words, separated by spaces, and places them around [ ] .

EDIT:

If you want to keep the number of spaces between words that you can use:

 $str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g; 
+2
source share

Try the following:

 $str =~ s/\b(\S+)\b(\s+\1\b)+/[\1]/g; 
0
source share

All Articles