I am currently using what I (mistakenly) thought was a fairly simple implementation of Solr NGramTokenizerFactory , but I get strange results that are incompatible between the admin parser and the actual query results, and I hope for some recommendations.
I am trying to get user inputs matching my NGram index (minGramSize = 2, maxGramSize = 2). My indexing and query time scheme is below, in which
- I am breaking all non-alphanumeric characters using
PatternReplaceCharFilter . - I trade using
NGramTokenizerFactory . LowerCaseFilterFactory using LowerCaseFilterFactory (which leaves non-letter markers in place, so my numbers will stay).
Using the diagram below, I would think that a search for βPCB-1260β (with a properly escaped dash) should match an indexed Ngram value with an index and a lower value of βArochlor-1260β (ie 1260 bigrams β12 26 60β both in the indexed value and in the requested value).
Unfortunately, I do not get any results if I do not delete the dash. [EDIT - even if I exit the dash correctly and leave it in the request, I also do not get any results]. This seems odd because I am doing a complete pattern replacement of all alphanumeric characters with PatternReplaceCharFilter - which I suppose removes all spaces and dashes.
The query analyzer on the admin page shows the correct mapping using the diagram below - so I lost a little. Is there something fundamental in PatternReplaceCharFilter or NGramTokenizerFactory that I'm missing here?
I checked the code and other posts, but it seems I can't figure it out. A week later, hitting my head against the wall, I pass it to the stack ....
<fieldtype name="tokentext" class="solr.TextField" positionincrementgap="100"> <analyzer type="index"> <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/> <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <charfilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement=""/> <tokenizer class="solr.NGramTokenizerFactory" mingramsize="2" maxgramsize="2"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype>
source share