Remove all occurrences of words in a string from python list

I try to match and remove all words from a list from a string using a compiled regular expression, but I try to avoid appearing in words.

Current:

REMOVE_LIST = ["a", "an", "as", "at", ...] remove = '|'.join(REMOVE_LIST) regex = re.compile(r'('+remove+')', flags=re.IGNORECASE) out = regex.sub("", text) 

Q: "A quick brown fox jumped over ant"

Out: "fast brown fox jumped over t"

Expected: "brown fox spread quickly"

I tried to change the line for compilation with the following, but to no avail:

  regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE) 

Any suggestions or am I missing something brightly obvious?

+6
source share
2 answers

One problem is that only inside \b is inside the line. The second is interpreted as a backspace character (ASCII 8), and not as a word boundary.

To fix, change

 regex = re.compile(r'\b('+remove+')\b', flags=re.IGNORECASE) 

to

 regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE) ^ THIS 
+8
source

Here is a suggestion without using a regular expression that you might consider:

 >>> sentence = 'word1 word2 word3 word1 word2 word4' >>> remove_list = ['word1', 'word2'] >>> word_list = sentence.split() >>> ' '.join([i for i in word_list if i not in remove_list]) 'word3 word4' 
+16
source

All Articles