Regex Game - Replace each word, except defined, with a variable number of characters

Hey you Regex lovers!

I am now in Regex, and had a purely theoretical problem. Simply put, I will present it as a game.

The game:
Say you have a list of words separated by spaces.
I call the word as defined by regular expressions: [a-zA-Z_0-9]+ (there is no empty word)
List example:
Horse Banana Joker RoXx0r A_Long_Word Joker 1337

I want you to replace every word except Joker with a number equal to the number of characters of the matching word.
With our previous list we get:
$$$$$ $$$$$$ Joker $$$$$$ $$$$$$$$$$$ Joker $$$$

In fewer words: I want the regular expression to match every character that does not belong to the word "Joker" (in the line I mean that it is not the word Joker)

Although this is not easy, it is not impossible (for this I have my own regular expression). That is why I will set some rules.

Rules:

  • This needs to be done with only 1 regular expression
  • I will not accept any regex that only works in certain languages
  • I still agree with the most common features, such as Conditional, Search Engines, etc., even if some languages ​​cannot read them.
  • Recursions are not allowed (but if you have a working recursive, submit it, just for the beauty of the regular expression ^^)
  • Regular expression should be optimized for performance.
  • If your regular expression matches (get it ?;)) these rules, but does not satisfy me, I will feel free to add a few more rules

Added rules:

  • None



To help you, here are a few lines that regex runs on :
Horse Banana Joker RoXx0r A_Long_Word Joker 1337 Joke Poker Joker Jokers
Must return after replacement:
$$$$$ $$$$$$ Joker $$$$$$ $$$$$$$$$$$ Joker $$$$ $$$$ $$$$$ Joker $$$$$$

Joker Joker Joker
Must return after replacement:
Joker Joker Joker

Again, solving the problem is not the goal here , I want to see different solutions, and, more importantly, I want to see the best!


Solutions:

Very elegant Casimir and Hippolytus :
(?:\G(?!^)|(?<!\S)(?!Joker(?:\s|$)))\S (replace: $ )
View Post
However, \ G copes with the problem perfectly and does not work in all languages, so I cannot accept it unless it is possible to create a custom separator equivalent to \ G

The almost accepted answer is also Casimir and Hippolytus :
((?:\s+|\bJoker\b)*)\S((?:\s+Joker)*\s*$)? (replace: $1$$2 )
View Post
It doesn’t work when the line contains only the words Joker

A similar solution to ClasG :
(\bJoker[^\w]+)\w|\w([^\w]+Joker\b)|\w (replace: $1$$2 )
See message
It doesn’t work when the line contains only the words Joker

Another from ClasG :
[^Joker\s]|(?<!\b)J|J(?!oker\b)|(?<!\bJ)o|o(?!ker\b)|(?<!\bJo)k|k(?!er\b)|(?<!\bJok)e|e(?!r\b)|(?<!\bJoke)r|r(?!\b) (replace: $ )
View Post
Not very effective, though, but this is another way to see things;)

I came up with a similar regex after reading Rahul's comment below:
(?(?<=\b|\bJ|\bJo|\bJok|\bJoke|\bJoker)(?!(?:Joke|oke|ke|e|)r\b)\w|\w) ( replace $ )
Regex101
It is also inefficient, but uses the same renaming information :)

Here is my first solution:
I use a trick that can be considered a hoax, but I'm not because it does not change the functions that you use to replace the characters. You just need to add "$" at the end of the line before replacing characters in it.
Therefore, instead of something like:
string = replace(string, regex, '$1$2')
We would have:
string = replace(string+'$', regex, '$1$2')

So here is the regex:
(\bJoker\b)|.$|\w(?=.*(\$)) (replace: $1$2 )
Regex 101
This should work with all languages ​​except those that do not support views (they are quite rare)


Keep publishing a new regex if you find them, I want to see more ways to do this! :)

+7
regex
source share
3 answers

For PCRE / Perl / Ruby / Java / .net

find:

 (?:\G(?!^)|(?<!\S)(?!Joker(?!\S)))\S 

replace:

 $ 

demo

more details:

 (?: \G (?!^) # contigous to a previous match (but not at the start of the string) | # OR (?<!\S) # not preceded by a non white-space (?!Joker(?!\S)) # not followed by the forbidden word ) \S # a non-whitespace character 

If your words consist only of the characters of the word, you can simplify the pattern playing with and without words: (?:\G\B|\b(?!Joker\b))\w


Another way (PCRE / Perl): without the \G function and using the backtracking verb (*SKIP) (fewer steps are required):

 \s*(?:Joker(?:\s+|$))*(*SKIP)\K. 

Sharpness (*SKIP) is only useful when the line ends with a forbidden word or space. You can also replace it with (*COMMIT) .

demo

or

 \bJoker\b(*SKIP)(*F)|\S 

and with the regex pypi python module (which has a word boundary for the beginning and one for the end of the word):

 \mJoker\M(*SKIP)(*F)|\S 

One that works with Javascript (if there is something just for replacement):

find:

 ((?:\s+|\bJoker\b)*)\S((?:\s+Joker)*\s*$)? 

replace: (backreference to group1, escaped $, backreference to group2)

 $1$$$2 

demonstration


Another version of Javascript that uses the y flag (which makes matches match), but unfortunately this one is not supported by Internet Explorer, Safari, and mobile browsers other than Firefox mobile:

 var strs = ['Horse Banana Joker RoXx0r A_Long_Word Joker 1337 Joke Poker Joker', 'Joker Joker Joker']; strs.forEach(function (s) { console.log(s.replace(/(?=((?:\s+|\bJoker\b)*))\1./gy, '$1$$')); }); 

(?=(...))\1 emulates an atomic group (which prohibits backtracking).

+4
source share

OK, here we are again and again;) This time with a complete solution that should work in most regex variants (except for JS). It is not very flexible, but it works:

 [^Joker\s]|(?<!\b)J|J(?!oker\b)|(?<!\bJ)o|o(?!ker\b)|(?<!\bJo)k|k(?!er\b)|(?<!\bJok)e|e(?!r\b)|(?<!\bJoke)r|r(?!\b) 

or more readable

 [^Joker\s] # Test for any character not belonging to the word Joker | (?<!\b)J|J(?!oker\b) # Test for J not belonging to the word Joker | (?<!\bJ)o|o(?!ker\b) # Test for o not belonging to the word Joker | (?<!\bJo)k|k(?!er\b) # Test for k not belonging to the word Joker | (?<!\bJok)e|e(?!r\b) # Test for e not belonging to the word Joker | (?<!\bJoke)r|r(?!\b) # Test for r not belonging to the word Joker 

It matches the characters from the word Joker separately, making sure they are not in that word, using backtracks and expectations. Alternatively, it does not match all letters in the word.

Replacing matches with $ does the job.

Here he is in regex101 .

Edit

Reordered the test to make it much more efficient. (From> 1600 to ~ 1100 steps.)

+2
source share

I can’t say why, but I wanted to see if I could do this without review. This is what I came across:

 (\bJoker[^\w]+)\w|\w([^\w]+Joker\b)|\w 

Substituting this with $1$$2 , you need to do the trick.

It has one limitation though (which I was thinking about). He will not treat Joker as one word on the line :( This is because the logic behind him ...

It matches the word Joker in two alternations - either with the letter following it, or before it. In both cases, separating the word from the letter is not a letter (spaces). There is a third alternative - one letter. If neither of the first two matches is found, it will not contain letters not related to Joker. In the first two cases, the word plus adjacent spaces (non-letters) are captured in the group ( Joker space). The same applies to the second alternative, but in reverse order (space - Joker ). The third option does not fix anything. it just matches the letter.

Replacing the complete match with $1$$2 (pay attention to the literal $ in the middle), or insert the word Joker plus spaces (if the first match is alternating), and then $ . If the first did not match, but the second did, the inserted replacement would be $ plus trapped spaces, and then Joker . If none of the first two matches, nothing is fixed, and the only thing inserted is the only $ that replaces any letter matched.

See here at regex101.

Edit:

I just noticed that Casimir et Hippolyte has a version at the end that is similar to mine. They are not identical, therefore, I will leave my answer here;)

+1
source share

All Articles