Python regex to replace everything except certain words

I am trying to do the following with regex :

import re x = re.compile('[^(going)|^(you)]') # words to replace s = 'I am going home now, thank you.' # string to modify print re.sub(x, '_', s) 

The result is:

 '_____going__o___no______n__you_' 

As a result, I want:

 '_____going_________________you_' 

Since ^ can only be used in brackets [] , this result makes sense, but I'm not sure how to do it.

I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it gave '_g_h___y_' .

+6
source share
2 answers

Not as easy as it seems at first glance, since there is no “no” in RE, except for ^ inside [ ] , which matches only one character (as you found). Here is my solution:

 import re def subit(m): stuff, word = m.groups() return ("_" * len(stuff)) + word s = 'I am going home now, thank you.' # string to modify print re.sub(r'(.+?)(going|you|$)', subit, s) 

gives:

 _____going_________________you_ 

To explain. RE itself (I always use raw strings) matches one or more characters ( .+ ), But is not greedy ( ? ). This is fixed in the first group of parentheses (brackets). This is followed by either "go" or "you" or the end of the line ( $ ).

subit is a function (you can call it anything) that is called for each lookup. The matching object is passed from which we can get the captured groups. The first group, we just need the length, since we replace each character with an underscore. The returned string is replaced with what matches the pattern.

+5
source

Here is one regex approach:

 >>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s) '_____going_________________you_' 

The idea is that when you are dealing with words and you want to exclude them or you must remember that most regex engines (most of them use traditional NFA) parse strings with characters. And here, since you want to exclude two words and want to use a negative lookahead, you need to define valid strings as words (using the word boundary), and since in sub it replaces matching patterns with a string replacement, you cannot just pass _ , because in this case, it will replace the part as I am with 3 underscores ( I , '', 'am'). Thus, you can use the function to pass sub as the second argument and multiply by _ length of the replaced string.

+3
source

All Articles