Java library that finds sentence boundaries

Question

Java library that finds sentence boundaries

Does anyone know of a Java library that handles the boundaries of sentences? I think this would be a smart StringTokenizer implementation that knows about all sentence terminators that languages can use.

Here is my experience with BreakIterator:

Using an example here : I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い！とても快適です。

In ascii, it looks like this:

 \ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002

Here is the part of this sample that I changed: static void sentenceExamples () {

  Locale currentLocale = new Locale ("ja","JP"); BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale); String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

When I look at border indices, I see the following:

 0|13|24|32

But these indexes do not match sentence terminators.

+7

java string text-segmentation nlp

Mike sickler Jan 27 '09 at 13:13

source share

2 answers

You want to learn the internationalized BreakIterator classes. A good starting point for ad borders .

+4

Garyf Jan 27 '09 at 13:16

source share

Fabian steeg · Accepted Answer · 2009-01-27T16:13:23+0000

You wrote:

I think this would be a smart StringTokenizer implementation that knows about all sentence terminators that languages can use.

The main problem here is that the terms of the conclusions depend on the context, consider:

How Dr. Jones figured out 5! without recursion?

This should be recognized as one sentence, but if you simply split up into possible terminator sentences, you will receive three sentences.

So this is a more complex problem than you might think at first. It can be approached using machine learning methods. For example, you can see the OpenNLP project, in particular SentenceDetectorME .

Java library that finds sentence boundaries

More articles: