Python cut the string after the Xth sentence

Question

Python cut the string after the Xth sentence

I need to cut out a unicode string, which is actually an article (contains sentences). I want to cut this line of article after the Xth sentence in python.

A good indicator of the end of a sentence is that it ends with a full stop (".") And a word after the start with a name. For instance,

myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third."

How can this be achieved?

thanks

+4

python string

Hellnar Aug 05 '10 at 6:42

source share

4 answers

Here is a more reliable solution:

 myarticle = """This is a sentence. And another one. And a 3rd one.""" N = 3 # 3 sentences print ''.join(sentence+'.' for sentence in re.split('\.(?=\s*(?:[AZ]|$))', myarticle, maxsplit=N)[:-1])

This solution has several advantages over some of the other features mentioned earlier:

It works even if there are exactly N sentences in the text. Some other answers give double . in the end. This can be avoided if you consider that the last sentence is not accompanied by a capital letter, but by the end of the text ( $ ).
This works even if the text contains less than N sentences.
The number of partitions is limited by the argument maxsplit to re.split() , which limits the number of splits and is therefore very efficient.

Hope this helps!

+2

Eol Aug 05 '10 at 7:58

source share

If there may be other punctuation marks than the usual ".", You should probably try the following:

 re.split('\W(?=[AZ])',ss)

Returns a list of offers. Of course, this is not true of the cases mentioned by Paul.

+1

xmoleslo Aug 05 '10 at 7:38

source share

Try the following:

 '.'.join(re.split('\.(?=\s*[AZ])', myarticle)[:2]) + '.'

It shortens your line after the second sentence ([: 2]).

However, there are some problems (as always, if you are dealing with natural language): First of all, it recognizes only the sentence starting with "AZ". This may be true for English, but not for other languages.

0

Felix schwarz Aug 05 '10 at 7:01

source share

Tim McNamara · Accepted Answer · 2010-08-05T07:11:12+0000

Consider downloading the Natural Language Toolkit ( NLTK ). Then you can create sentences that won't break for things like "USA" or can't split sentences ending in "?!".

 >>> import nltk >>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third." >>> sentences = nltk.sent_tokenize(paragraph) [u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]

Your code becomes much more readable. To access the second sentence, you use the notation you are used to.

 >>> sentences[1] u"And this is my second."

Python cut the string after the Xth sentence

More articles: