Indexing multilingual content with Lucene.net

I use Lucene.net to index content and documents, etc. on websites. The index is very simple and has this format:

  LuceneId - unique id for Lucene (TypeId + ItemId)
 TypeId - the type of text (eg. Page content, product, public doc etc ..)
 ItemId - the web page id, document id etc ..
 Text - the text indexed
 Title - web page title, document name etc .. to display with the search results

I have these options to adapt it to serve multilingual content:

  • Create a separate index for each language. For instance. Lucene-enGB, Lucene-frFR, etc.
  • Save one pointer and add an additional "language" field to it to filter the results.

Which option is better - or is there another? I have not used multiple indexes before, so I'm leaning towards the second.

+4
source share
2 answers

I do [2], but one problem is that I cannot use different analyzers depending on the language. I combined the stop words of the languages ​​I want, but I am losing the possibility of more advanced materials that the analyzer will offer, for example, shutdown, etc.

+2
source

You can exclude options 1 and 2.
You can use one index, and fields containing Arabic words create two entries for each of them: If you have a Text field, it may contain Arabic or English content ==>

  • Create 2 fields for " Text ": 1 field, " Text ", indexed / normal using your standard analyzer, and the other - " Text_AR ", with the Arabic Analyzer. To achieve this, you can use PreFieldAnalyzerWrapper
+1
source

All Articles