Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - analyzer results are incompatible with query results

Question

Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - analyzer results are incompatible with query results

I am currently using what I (mistakenly) thought was a fairly simple implementation of Solr NGramTokenizerFactory , but I get strange results that are incompatible between the admin parser and the actual query results, and I hope for some recommendations.

I am trying to get user inputs matching my NGram index (minGramSize = 2, maxGramSize = 2). My indexing and query time scheme is below, in which

I am breaking all non-alphanumeric characters using PatternReplaceCharFilter .
I trade using NGramTokenizerFactory .
LowerCaseFilterFactory using LowerCaseFilterFactory (which leaves non-letter markers in place, so my numbers will stay).

Using the diagram below, I would think that a search for “PCB-1260” (with a properly escaped dash) should match an indexed Ngram value with an index and a lower value of “Arochlor-1260” (ie 1260 bigrams “12 26 60” both in the indexed value and in the requested value).

Unfortunately, I do not get any results if I do not delete the dash. [EDIT - even if I exit the dash correctly and leave it in the request, I also do not get any results]. This seems odd because I am doing a complete pattern replacement of all alphanumeric characters with PatternReplaceCharFilter - which I suppose removes all spaces and dashes.

The query analyzer on the admin page shows the correct mapping using the diagram below - so I lost a little. Is there something fundamental in PatternReplaceCharFilter or NGramTokenizerFactory that I'm missing here?

I checked the code and other posts, but it seems I can't figure it out. A week later, hitting my head against the wall, I pass it to the stack ....

 <fieldtype name="tokentext" class="solr.TextField" positionincrementgap="100"> <analyzer type="index"> <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/> <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement=""/> <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype>

+1

regex solr n-gram

Josh Jun 23 '11 at 19:40

source share

1 answer

Josh · Accepted Answer · 2011-07-08T18:38:56+0000

So - something is definitely odd with PatternReplaceCharFilter without removing the dash during the request. Ultimately, I just did some preprocessing of the request in php user input with preg_replace before posting to Solr, and - viola! - worked like a charm with the expected results. It's amazing that PatternReplaceCharFilter didn't behave ...

Here is a preliminary php code request that I used to get rid of dashes if anyone needs it.

 $pattern = '/([-])/'; $replacement = ' '; $usrpar = preg_replace($pattern, $replacement, $raw_user_search_contents); $res = htmlentities($usrpar, ENT_QUOTES, 'utf-8');

After that, I just passed $ res to Solr ...

Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - analyzer results are incompatible with query results

More articles: