Delete all articles, word sockets, etc. From a string in Python

Question

Delete all articles, word sockets, etc. From a string in Python

I have a list containing many suggestions. I want to iterate over the list, removing words like "and", "", "a", "is" from all sentences, etc.

I tried this:

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
for i, j in articles.iteritems():
    text = text.replace(i, j)
return text

As you can probably say, this will remove “a” and “an” when it appears in the middle of the word. I need to delete only instances of words when they are limited by a space, and not when they are inside a word. What is the most efficient way to do this?

+5

python string

Parseltongue Jan 17 '11 at 3:05

source share

5 answers

NLP, , . NLTK (http://www.nltk.org/). IIRC , , , .

+3

waffle paradox 17 . '11 3:41

Try something line by line

articles = ['and', 'a']
newText = ''
for word in text.split(' '):
    if word not in articles:
        newText += word+' '
return newText[:-1]

+1

erbridge Jan 17 '11 at 3:20

source share

def removearticles(text):


articles = {'a': '', 'an':'', 'and':'', 'the':''}
rest = []
for word in text.split():
    if word not in articles:
        rest.append(word)
return ' '.join(rest)

in the dict run statement is faster than the list.

+1

xiaowl Jan 17 '11 at 3:38

source share

This can be done using regex . Iterator through your lines or (''. Join the list and send it as a string) to the next regular expression.

>>> import re
>>> rx = re.compile(r'\ban\b|\bthe\b|\band\b|\ba\b')
>>> rx.sub(' ','a line with lots of an the and a baad')
'  line with lots of         baad'

0

Senthil kumaran Jan 17 '11 at 3:25

source share

Nemo157 · Accepted Answer · 2011-01-17T03:19:29+0000

I would like to use a regex, something like:

def removearticles(text):
  re.sub('(\s+)(a|an|and|the)(\s+)', '\1\3', text)

or if you also want to remove leading spaces:

def removearticles(text):
  re.sub('\s+(a|an|and|the)(\s+)', '\2', text)

Delete all articles, word sockets, etc. From a string in Python

More articles: