Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg' etc.

Hey, I love regular expressions, but I'm just not very good at them.

I have a list of 400 abbreviated words like lol, omg, lmao ... etc. Whenever someone introduces one of these abbreviated words, he is replaced by his English colleague ([laughter] or something like that). In any case, people annoy and type these short words with the last letter (s) repeated x times.

examples: omg β†’ omgggg, lol β†’ lollll, haha ​​-> hahahaha, lol β†’ lololol

I was wondering if anyone could pass me a regex (in Python, preferably) to handle this?

Thanks to everyone.

(This is a Twitter-related project to identify a topic if someone is interested. If someone comments on β€œLet go shoot some hoops”, as you know, the tweet is about basketball, etc.)

+4
source share
2 answers

FIRST APPROACH -

Well, using regular expressions (s), you can do this:

import re re.sub('g+', 'g', 'omgggg') re.sub('l+', 'l', 'lollll') 

and etc.

Let me point out that using regular expressions is a very fragile and basic approach to solving this problem. You can get strings so easily from users who break the above regular expressions. What I'm trying to say is that this approach requires a lot of maintenance in terms of observing the patterns of errors that users make and then creating regular expressions for them for specific cases.

SECOND APPROACH -

Instead, did you decide to use the difflib module? This is a module with helpers for calculating deltas between objects. SequenceMatcher is especially important to you here. To rephrase from official documentation -

SequenceMatcher is a flexible class for comparing pairs of sequences of any type if sequence elements are hashed. SequenceMatcher is trying to compute a human-friendly diff between two sequences. the fundamental concept is the longest sequential and non-accumulating subsequence.

 import difflib as dl x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg") y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg") avg = (x.ratio()+y.ratio())/2.0 if avg>= 0.6: print 'Match!' else: print 'Sorry!' 

According to the documentation, any ratio () greater than 0.6 is close. You may need to examine the settings for your data. If you need a stricter match, I find that any value greater than 0.8 serves well.

+7
source

What about

 \b(?=lol)\S*(\S+)(?<=\blol)\1*\b 

(replace lol with omg , haha , etc.)

This will match lol , lololol , lollll , lollollol , etc., but lolo , lollllo , lolly , etc. will not work.

Rules:

  • Match the word lol all the way.
  • Then allow any repetition of one or more characters at the end of the word (i.e. l , ol or lol )

So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg , zomggg , zomgmgmg , zomgomgomg , etc.

In Python with comments:

 result = re.sub( r"""(?ix)\b # assert position at a word boundary (?=lol) # assert that "lol" can be matched here \S* # match any number of characters except whitespace (\S+) # match at least one character (to be repeated later) (?<=\blol) # until we have reached exactly the position after the 1st "lol" \1* # then repeat the preceding character(s) any number of times \b # and ensure that we end up at another word boundary""", "lol", subject) 

This will also correspond to the β€œunpainted” version (for example, lol without repetition). If you do not want this, use \1+ instead of \1* .

+3
source

All Articles