Separation of paragraph into separate sentences. Am I covering all my bases here?

I am trying to split a string with multiple sentences into an array of strings from separate sentences.

That's what I still have

String input = "Hello World. " + "Today in the USA, it is a nice day! " + "Hurrah!" + "Here it comes... " + "Party time!"; String array[] = input.split("(?<=[.?!])\\s+(?=[\\D\\d])"); 

And this code works fine. I get,

 Hello World. Today in the USA, it is a nice day! Hurrah! Here it comes... Party time! 

I use lookbehind functionality to see if the punctuation prefix ends with one or one white space(s) . If so, we will split up.

But there are some exceptions that this regular expression does not cover. For example, The US is a great country , is incorrectly split as The US and is a great country .

Any idea on how I can fix this?

And also, are there any missing edge cases here?

+6
source share
1 answer

If you don't need to use regex, you can use the built-in Java BreakIterator .

The following code shows an example of parsing sentences, however BreakIterator supports other forms of parsing (word, string, etc.). You can also, if necessary, go to different places if you are dealing with different languages. This example uses the default locale.

 String input = "Hello World. " + "Today in the USA, it is a nice day! " + "Hurrah!" + "The US is a great country. " + "Here it comes... " + "Party time!"; BreakIterator iterator = BreakIterator.getSentenceInstance(); iterator.setText(input); int start = iterator.first(); for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) { System.out.println(input.substring(start, end)); } 

The result is the following:

 Hello World. Today in the USA, it is a nice day! Hurrah! The US is a great country. Here it comes... Party time! 
+10
source

All Articles