How to make a Solr spell to correct Latin and Cyrillic words?

I allow users to enter Russian words in Latin letters. If the user seals the Russian word in Latin letters, I want the Solr spell to offer the correct word in Cyrillic (Russian words in the index are in Cyrillic). However, if the user mistakenly types a non-Russian word (for example, a company name), it should be corrected in Latin letters (non-Russian words in the index in Latin).

For example, tilevizor smasung should be fixed on samsung

Now I use the following configuration:

 <fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" /> </analyzer> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.LengthFilterFactory" min="3" max="256" /> </analyzer> </fieldType> 

It converts the request into Cyrillic letters, so the correction of Russian words works. But Latin does not. ( tilevizor to works, but smasung to samsung doesn't work).

Any ideas how I can get the spell checker to correct both Cyrillic and Latin words?

+7
solr
source share
1 answer

I think this solution that might help here is the Beider-Morse Phonetic Matching (BMPM)

Beider-Morse Phonetic Matching (BMPM) is a β€œsound tool” that allows you to perform searches using the new phonetic matching system.

So, for example, the words "tilevizor" and "TV" will be similar, and we will get a match. Something that could be tweaked is a phonetic matching algorithm. Solr supports many of them, and I'm not sure which one will be better: DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0), ColognePhonetic or Nysiis.

In addition, I would like to update solr.ICUTransformFilterFactory with id="Russian-Latin/BGN" , which convert Russian characters to Latin characters much better.

  <fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true"> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/> <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/> <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/> </analyzer> </fieldType> 

The type of field above does the trick in many cases, for example

 q=title:tilevizor SolrDocument{title= samsung, _version_=1583123812650582016} SolrDocument{title=televizor , _version_=1583123812667359232} q=title: SolrDocument{title= samsung, _version_=1583123812650582016} SolrDocument{title=televizor , _version_=1583123812667359232} q=title:smasung SolrDocument{title= samsung, _version_=1583123812650582016} SolrDocument{title=televizor , _version_=1583123812667359232} SolrDocument{title= samsung, _version_=1583123812684136448} SolrDocument{title=galaxy , _version_=1583123812684136449} 

I created the following test class here , feel free to play with this.

+1
source share

All Articles