A multilingual free text search in an application with normalized data?

Question

A multilingual free text search in an application with normalized data?

Our database has listings, free text fields and links, etc.

Each enumerator has its own translation, free text can be in any language. We would like to do an effective large-scale free text search and based on enumeration values.

I know solutions like Solr that are good, but that would mean that we would need to index all the de-normalized entries with all the text of all languages in the system. It seems a little excessive.

What are the recommended approaches to finding multilingual normalized data? Has anyone dealt with this before?

+4

web-applications search full-text-search normalization multilingual

sym3tri Apr 22 '11 at 10:25

source share

1 answer

Michael dillon · Answer 1 · 2011-08-04T04:50:34+0000

ETL. Extract, convert, load. In other words, extract data from existing databases, convert them (which is more than just denormalizing them) and load them into SOLR. A SOLR drive will be much smaller than existing databases because there is no relational overhead. And SOLR search takes up most of your existing database servers.

Take a look at how to configure and use SOLR and learn about SOLR kernels. You may want to put some languages in separate kernels, because in this way you can use different generation algorithms in SOLR more efficiently. But even with multilingual data, you can still use bigrams (for example, for analysis in Chinese).

Having multiple cores makes searching more difficult, because you can try either the index of one language or the index of all languages. But it is much more efficient to group language data and apply language specific stop words, protected words, foundations and tools for analyzing the language.

Usually you include some key data in the index so that when you find an entry through the SOLR search, you can directly reference the db source. In addition, you can have normalized and non-normalized data together, for example, an enumeration can be written in a normalized field in English, as well as an unnormalized field in the same language as the free text. A field can be duplicated to apply two different analysis and filtering methods.

It would be helpful to check this with a subset of your data to find out how SOLR works and how best to configure it.

A multilingual free text search in an application with normalized data?

More articles: