How to identify duplicate characters with regular expression?

This question is about the regular expression puzzle. I have a list of words with duplicate characters, for example.

stubbornness
raccoon
cooccurred
successful

Note that each of the terms has two sets of repeating letters, for example. "bb", in "stubbornness." I already wrote my script (in ruby), and I can solve my problem using iteration of each character in the code loop.

However, this mysterious task caught my attention ... I wonder if this can be done using regular expression? I am already advising regular expression tutorials and other StackOverflow questions, but I can’t figure out how to report cheating. Here is the desired result:

bb stubbornness
cc raccoon
oo cooccurred
cc successful

. ( sed MacOS, -r Ubuntu -E):

sed -E 's#(.*?)(.)\2(.*)#\2\2 \1\2\2\3#g'

-. ? , .

+4
5

RegEx:

(.*?)((\w)\3)(.*)

:

\2 \1\2\4

Live Demo on Regex101


@Kent, , sed lazy .*?, RegEx:

(
(?!(\w)\2)       # DO NOT Match if there are double letters
.                # Data before dobule letters
)*
((\w)\4)         # Double Letter
(.*)             # Data after letters

# SHORTER REGEX (1 LINE)
((?!(\w)\2).)*((\w)\4)(.*)

:

\3 \0

Live Demo on Regex101

+2

gnu sed, ( rev, unix-util) :

 sed -r 's/.*(.)\1.*/echo "\1\1 $(echo \0|rev)"/ge' <(rev file)

:

kent$  cat f
stubbornness
raccoon
cooccurred
successful

kent$  sed -r 's/.*(.)\1.*/echo "\1\1 $(echo \0|rev)"/ge'  <(rev f)
bb stubbornness
cc raccoon
oo cooccurred
cc successful
+1

. . 2 ? , , , :

sed 's#\(.*\)\(.\)\2\(.*\)\(.\)\4#\2\2 &#'

: (no -r/-E sed). , , , . \4 \2 . : \(.*\)\(.\)\4, .

0

rev GNU sed -r?

$ rev file | sed -r 's/(.*((.)\3).*)/& \2/' | rev
bb stubbornness
cc raccoon
oo cooccurred
cc successful

FWIW , UNIX, Ruby, , sed:

$ awk -v FS= '{p=""; for (i=1;i<=NF;i++) { if ($i==p) {print p $i, $0; next} p=$i } }' file
bb stubbornness
cc raccoon
oo cooccurred
cc successful

awks FS, , awks:

$ awk '{p=""; for (i=1;i<=length($0);i++) { c=substr($0,i,1); if (c==p) {print p c, $0; next} p=c } }' file
bb stubbornness
cc raccoon
oo cooccurred
cc successful
0

Perl - :

> perl -ne 'print @_[0,0], " $_" if (@_ = /(.)\1/g) > 1' < words.txt
bb stubbornness
cc raccoon
oo cooccurred
cc successful
> 

This will not print anything for any input words that do not contain at least two sets of double letters. And it’s not much more to list all the doubles found in one word. And easy to configure for triple sets:

> perl -ne 'print @_[0,0], " $_" if (@_ = /(.)\1/g) > 2' < wordlist.txt
ss Mississippi
ss Mississippian
ll Tallahassee
nn Tennessee
dd addressee
tt bitterroot
oo bookkeep
mm committee

If you are looking for at least one pair of double letters, the problem becomes simpler:

perl -ne 'print "$& $_" if /(.)\1/' < wordlist.txt | tail
ll yellowish
ll you'll
tt ytterbium
tt yttrium
cc yucca
gg zigzagging
oo zoo
oo zoology
oo zoom
cc zucchini
0
source

All Articles