Lucene Japanese Character Search

Question

Lucene Japanese Character Search

I have implemented lucene for my application, and it works very well if you have not provided something like Japanese characters.

The problem is that if I have a Japanese string こんにちは, このバイネイです and I search with こ, this is the first character than it works well, whereas if I use more than one Japanese character (こんにち) in a token search search and no document was found.

Are Japanese characters supported in lucene? What settings do I need to make it work?

+7

c # asp.net lucene.net

Pranali desai Apr 15 '10 at 7:17

source share

3 answers

The built-in lucene analyzer does not support Japanese.

You need to install some kind of analyzer, for example sen , which is the java port of mecab , a rather popular Japanese analyzer, and its fast.

There are 2 subtypes

CJKAnalyzer, which supports Chinese and Korean, and using the bi-gram method.
JapaneseAnalyzer, which supports Japanese using the Morphological Analyzer and should be very fast.

+4

YOU Apr 15 '10 at 7:23

source share

You should use the new Japanese analyzers recently released in Lucene 3.6.0. They are based on the excellent Kuromoji morphological analyzer recently donated by Lucene in LUCENE-3305 .

Documents are a bit rare starting with this entry, so here are some more links ...

If you are using Solr, here is a sample diagram that will work on Websolr .
Slides from my presentation on April 20, 2012 in the herokip genre, in full-text search with a focus on the analysis of the Japanese language.

(This is all for the Java version of Lucene.)

0

Nick zadrozny Apr 30 '12 at 18:08

source share

Dean harding · Accepted Answer · 2010-04-15T07:43:06+0000

I do not think that there can be an analyzer that will work in all languages. The problem is that different languages have different rules about word boundaries and their occurrence (for example, Thai does not use spaces to separate words). Or, if there is, of course, I do not want to be accompanying!

What you will need to do is “tag” blocks of text as one or the other language and use the right parser for that particular language. You can try to detect the language “automatically” by performing a character analysis (ie, Text using primarily Japanese Katakana, probably Japanese).

Lucene Japanese Character Search

More articles: