Python cut the string after the Xth sentence

I need to cut out a unicode string, which is actually an article (contains sentences). I want to cut this line of article after the Xth sentence in python.

A good indicator of the end of a sentence is that it ends with a full stop (".") And a word after the start with a name. For instance,

myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third." 

How can this be achieved?

thanks

+4
source share
4 answers

Consider downloading the Natural Language Toolkit ( NLTK ). Then you can create sentences that won't break for things like "USA" or can't split sentences ending in "?!".

 >>> import nltk >>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third." >>> sentences = nltk.sent_tokenize(paragraph) [u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."] 

Your code becomes much more readable. To access the second sentence, you use the notation you are used to.

 >>> sentences[1] u"And this is my second." 
+15
source

Here is a more reliable solution:

 myarticle = """This is a sentence. And another one. And a 3rd one.""" N = 3 # 3 sentences print ''.join(sentence+'.' for sentence in re.split('\.(?=\s*(?:[AZ]|$))', myarticle, maxsplit=N)[:-1]) 

This solution has several advantages over some of the other features mentioned earlier:

  • It works even if there are exactly N sentences in the text. Some other answers give double . in the end. This can be avoided if you consider that the last sentence is not accompanied by a capital letter, but by the end of the text ( $ ).

  • This works even if the text contains less than N sentences.

  • The number of partitions is limited by the argument maxsplit to re.split() , which limits the number of splits and is therefore very efficient.

Hope this helps!

+2
source

If there may be other punctuation marks than the usual ".", You should probably try the following:

 re.split('\W(?=[AZ])',ss) 

Returns a list of offers. Of course, this is not true of the cases mentioned by Paul.

+1
source

Try the following:

 '.'.join(re.split('\.(?=\s*[AZ])', myarticle)[:2]) + '.' 

It shortens your line after the second sentence ([: 2]).

However, there are some problems (as always, if you are dealing with natural language): First of all, it recognizes only the sentence starting with "AZ". This may be true for English, but not for other languages.

0
source

All Articles