Does anyone know of a Java library that handles the boundaries of sentences? I think this would be a smart StringTokenizer implementation that knows about all sentence terminators that languages can use.
Here is my experience with BreakIterator:
Using an example here : I have the following Japanese:
今日はパソコンを買った。高性能のマックは早い!とても快適です。
In ascii, it looks like this:
\ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002
Here is the part of this sample that I changed: static void sentenceExamples () {
Locale currentLocale = new Locale ("ja","JP"); BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale); String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";
When I look at border indices, I see the following:
0|13|24|32
But these indexes do not match sentence terminators.
java string text-segmentation nlp
Mike sickler
source share