Although asked a couple of years ago, this is still a big question.
There are several aspects to consider different approaches to solving:
- Are these language specific analyzers used during indexing?
- - is the query language always known (for example, selectable by the user)?
- Does the query language always match one of the content languages?
- need to reconfigure only content that matches the query language?
- is relevance important?
If (1.) and (5.) are valid in your project, you should not consider any strategy that (re) uses the same field for several languages ββin the same inverted index, since frequency terms for different languages ββare all mixed (regardless of whether you index multilingual content as one document or as several documents). It would be interesting to know that adding βnβ language fields does not result in a larger βnβ -times index, but for obvious reasons, it comes with some overhead.
One field (Strategies 2 and 4)
+ only one field to query + scales well for additional languages + can distinguish/filter languages (if multiple documents, and extra language field) - cannot distinguish/filter languages (if single document) - cannot just display the queried language (if single document) - "wrong" term frequencies (as all languages mixed up)
Multiple Fields (Strategy 3)
+ correct term frequencies + can easily restrict/filter queries for particular language(s) + facilitates Auto-Complete & Spellcheck / Did-You-Mean - more fields to index - more fields to query
Multiple Indices (Strategy 1)
+ correct term frequencies + can easily restrict/filter queries for particular language(s) + facilitates Auto-Complete & Spellcheck / Did-You-Mean - additional languages requires all their own index
Regardless of the approach with one or more fields, your solution may need to handle collapsing the results for matches in the βwrongβ language if you index your content as multiple documents. One approach may be to add a language field and a filter for this.
Recommendation: The approach / strategy you choose depends on the project requirements. Whenever possible, I would choose an approach with multiple fields or multiple indexes.
Daniel Schneiter
source share