Indexing multilingual content with Lucene.net

Question

Indexing multilingual content with Lucene.net

I use Lucene.net to index content and documents, etc. on websites. The index is very simple and has this format:

  LuceneId - unique id for Lucene (TypeId + ItemId)
 TypeId - the type of text (eg. Page content, product, public doc etc ..)
 ItemId - the web page id, document id etc ..
 Text - the text indexed
 Title - web page title, document name etc .. to display with the search results

I have these options to adapt it to serve multilingual content:

Create a separate index for each language. For instance. Lucene-enGB, Lucene-frFR, etc.
Save one pointer and add an additional "language" field to it to filter the results.

Which option is better - or is there another? I have not used multiple indexes before, so I'm leaning towards the second.

+4

search localization lucene.net multilingual

Nick Feb 16 '09 at 14:06

source share

2 answers

You can exclude options 1 and 2.
You can use one index, and fields containing Arabic words create two entries for each of them: If you have a Text field, it may contain Arabic or English content ==>

Create 2 fields for " Text ": 1 field, " Text ", indexed / normal using your standard analyzer, and the other - " Text_AR ", with the Arabic Analyzer. To achieve this, you can use PreFieldAnalyzerWrapper

+1

Roxanne Mar 26 '13 at 20:39

source share

cherouvim · Accepted Answer · 2009-03-03T17:02:23+0000

I do [2], but one problem is that I cannot use different analyzers depending on the language. I combined the stop words of the languages I want, but I am losing the possibility of more advanced materials that the analyzer will offer, for example, shutdown, etc.

Indexing multilingual content with Lucene.net

More articles: