Delete all articles, word sockets, etc. From a string in Python

I have a list containing many suggestions. I want to iterate over the list, removing words like "and", "", "a", "is" from all sentences, etc.

I tried this:

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
    text = text.replace(i, j)
return text

As you can probably say, this will remove “a” and “an” when it appears in the middle of the word. I need to delete only instances of words when they are limited by a space, and not when they are inside a word. What is the most efficient way to do this?

+5
source share
5 answers

I would like to use a regex, something like:

def removearticles(text):
  re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)

or if you also want to remove leading spaces:

def removearticles(text):
  re.sub('\s+(a|an|and|the)(\s+)', '\2', text)
+6
source

NLP, , . NLTK (http://www.nltk.org/). IIRC , , , .

+3

Try something line by line

articles = ['and', 'a']
newText = ''
for word in text.split(' '):
    if word not in articles:
        newText += word+' '
return newText[:-1]
+1
source
def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
rest = []
for word in text.split():
    if word not in articles:
        rest.append(word)
return ' '.join(rest)

in the dict run statement is faster than the list.

+1
source

This can be done using regex . Iterator through your lines or (''. Join the list and send it as a string) to the next regular expression.

>>> import re
>>> rx = re.compile(r'\ban\b|\bthe\b|\band\b|\ba\b')
>>> rx.sub(' ','a line with lots of an the and a baad')
'  line with lots of         baad'
0
source

All Articles