Solr / Sunspot - defines indexing language at runtime, dynamically selects analyzers

I would like to use Solr + Sunspot to index the bilingual FR-EN site. Problem: The Post model can be written in both French and English. At run time, I can determine what a language is, but I also need Solr to index the model accordingly.

EG: for french models i need a french stem,

<filter class="solr.SnowballPorterFilterFactory" language="French"/> 

What are my options? Can I modify Solr analyzers at runtime? Can I make a set of analyzers for each language?

+7
source share
2 answers

This is a big question and feature discussed for inclusion in Sunspot.

Sunspot uses dynamic field naming conventions to customize its schema. For example, here are two existing definitions for text fields:

 <dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/> <dynamicField name="*_texts" stored="true" type="text" multiValued="true" indexed="true"/> 

They correspond to fieldType name="text" defined earlier in the schema.

 <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

You can add a similar definition for the different languages ​​you want to index (as Mauricio mentions), and then set up some new dynamicField definitions to use them.

1. A fieldType for a French text field

 <fieldType name="text_fr" class="solr.TextField" omitNorms="false"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="French"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

2. A dynamicField for a French text field

 <dynamicField name="*_text_fr" stored="false" type="text" multiValued="true" indexed="true"/> <dynamicField name="*_texts_fr" stored="true" type="text" multiValued="true" indexed="true"/> 

3. Using a French text field in Sunspot

The latest Sunspot 1.2 (not completely released - use 1.2.rc4) supports the :as parameter, which allows you to specify a field name.

 searchable do text :description, :as => 'description_text_fr' end 

As I said, this is what I am going to add to Sunspot 1.3 or 1.4. Personally, I would like to see something like :lang => :en in the text field definition to select the appropriate field definition. Feel free to call your thoughts on the Sunspot mailing list!

+10
source

I can't say anything about Sunspot, but in pure Solr I would create separate field types in your Solr scheme (one fieldType for French, another for English), then create one field for English content (using English fieldType) and another field for French content (using French fieldType).

Since you know which language to use at run time, you simply select one field or another to start the search and get the results.

+2
source

All Articles