Regular expression for counting sentences in a block of text

Possible duplicate:
PHP - How to break a paragraph into sentences.

I have a block of text that I would like to divide into sentences, what would be the best way to do this? I thought to search for ".", "!", "?" characters, but I realized that there are some problems with this, for example, when people use acronyms or end sentences with something like!?. What would be the best way to handle this? I decided that there would be some kind of regular expression that could handle this, but I am open to a solution without regular expressions if this is better suited to the problem.

+6
php regex nlp
source share
3 answers

Regex is not the best solution to this problem. You will be best served by creating a parsing library. Something where you easily create logic blocks to distinguish one from the other. You need to come up with a set of rules that break the text into pieces that you would like to see.

"Are you sure?" he asked. 

Does this work when using regular expressions? However with the parser you can really see

 <start quote><capitalization>are you sure<question><end quote>he asked<period> 

that with simple rules one could say "this is one sentence."

+2
source share

Unfortunately, there is no ideal solution for this for the reasons you indicated. If the content that you can in any way control or force the specified separator after each sentence, this would be ideal. Among other things, all you really can do is look for (\.|!|?)+ And maybe even insert a \ s after that, as most people impose new sentences with 1 or 2 spaces between the previous and next offer.

+1
source share

I think the biggest problem is the possible existence of acronyms! Therefore, you should use something like Prof.&nbsp;Knuth in the JavaDoc summary clause so that the javadoc generator does not think that the first clause ends after Prof. . This is a problem, I donโ€™t know how someone can reliably handle it. The only approximate solution I could imagine was to use the abbreviations dictionary.

0
source share

All Articles