Lucene Japanese Character Search

I have implemented lucene for my application, and it works very well if you have not provided something like Japanese characters.

The problem is that if I have a Japanese string ใ“ ใ‚“ ใซ ใก ใฏ, ใ“ ใฎ ใƒ ใ‚ค ใƒ ใ‚ค ใง ใ™ and I search with ใ“, this is the first character than it works well, whereas if I use more than one Japanese character (ใ“ ใ‚“ ใซ ใก) in a token search search and no document was found.

Are Japanese characters supported in lucene? What settings do I need to make it work?

+7
source share
3 answers

I do not think that there can be an analyzer that will work in all languages. The problem is that different languages โ€‹โ€‹have different rules about word boundaries and their occurrence (for example, Thai does not use spaces to separate words). Or, if there is, of course, I do not want to be accompanying!

What you will need to do is โ€œtagโ€ blocks of text as one or the other language and use the right parser for that particular language. You can try to detect the language โ€œautomaticallyโ€ by performing a character analysis (ie, Text using primarily Japanese Katakana, probably Japanese).

+3
source share

The built-in lucene analyzer does not support Japanese.

You need to install some kind of analyzer, for example sen , which is the java port of mecab , a rather popular Japanese analyzer, and its fast.

There are 2 subtypes

  • CJKAnalyzer, which supports Chinese and Korean, and using the bi-gram method.
  • JapaneseAnalyzer, which supports Japanese using the Morphological Analyzer and should be very fast.
+4
source share

You should use the new Japanese analyzers recently released in Lucene 3.6.0. They are based on the excellent Kuromoji morphological analyzer recently donated by Lucene in LUCENE-3305 .

Documents are a bit rare starting with this entry, so here are some more links ...

  • If you are using Solr, here is a sample diagram that will work on Websolr .
  • Slides from my presentation on April 20, 2012 in the herokip genre, in full-text search with a focus on the analysis of the Japanese language.

(This is all for the Java version of Lucene.)

0
source share

All Articles