Multilingual search using lucene

I am performing a multilingual search. And I will use lucene as a tool for this.

I have translated content already, there will be 3 or 4 languages ​​of each document.

There can be 4 strategies for indexing and searching. For each document / content:

  • each language is indexed in a different index / directory.
  • each language is indexed in a different document, but in the same index.
  • each language is indexed in a different field, but in the same document.
  • all languages ​​are indexed in one field in the document

But I have not tested each of them yet, could anyone try to tell which one is best for multilingual searches?

Thanks!

+7
source share
2 answers

In short, it depends on your needs, but I would go with option 3 or 1.

1) is probably the best way if there is no overlap / common fields between languages ​​at all.

3) it would be possible if there were several fields that should be shared between languages, as this saves disk space and allows most of the index to enter the file system cache

I would not recommend 2): it makes your search queries more complicated and makes lucene consider more documents.

4) will make your search query very difficult if you do not want users to be able to search in any language without first selecting it.

+1
source

Although asked a couple of years ago, this is still a big question.

There are several aspects to consider different approaches to solving:

  • Are these language specific analyzers used during indexing?
  • - is the query language always known (for example, selectable by the user)?
  • Does the query language always match one of the content languages?
  • need to reconfigure only content that matches the query language?
  • is relevance important?

If (1.) and (5.) are valid in your project, you should not consider any strategy that (re) uses the same field for several languages ​​in the same inverted index, since frequency terms for different languages ​​are all mixed (regardless of whether you index multilingual content as one document or as several documents). It would be interesting to know that adding β€œn” language fields does not result in a larger β€œn” -times index, but for obvious reasons, it comes with some overhead.


One field (Strategies 2 and 4)


+ only one field to query + scales well for additional languages + can distinguish/filter languages (if multiple documents, and extra language field) - cannot distinguish/filter languages (if single document) - cannot just display the queried language (if single document) - "wrong" term frequencies (as all languages mixed up) 

Multiple Fields (Strategy 3)


 + correct term frequencies + can easily restrict/filter queries for particular language(s) + facilitates Auto-Complete & Spellcheck / Did-You-Mean - more fields to index - more fields to query 

Multiple Indices (Strategy 1)


 + correct term frequencies + can easily restrict/filter queries for particular language(s) + facilitates Auto-Complete & Spellcheck / Did-You-Mean - additional languages requires all their own index 

Regardless of the approach with one or more fields, your solution may need to handle collapsing the results for matches in the β€œwrong” language if you index your content as multiple documents. One approach may be to add a language field and a filter for this.

Recommendation: The approach / strategy you choose depends on the project requirements. Whenever possible, I would choose an approach with multiple fields or multiple indexes.

+1
source

All Articles