Sunspot / Solr: Non Alphabetical Characters

Question

Sunspot / Solr: Non Alphabetical Characters

I am using Solr with Sunspot / dismax. Can non-alphabetic characters be requested? Ie:

~ ! @ # $ % ^ & * ( ) _ + - = [ ] { } | \

I know that +/- should be escaped since they are inconvenient inclusion / exclusion operators. But I don't get matches when looking for any of these characters:

 Foo.search { fulltext '=' }.results.length # => 0 Foo.search { fulltext '\=' }.results.length # => 0

Nonetheless:

 Foo.search { fulltext 'a'}.results.length # => 30

Here is the tokenizer configuration I'm using:

  <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

+4

ruby lucene solr sunspot dismax

George Armhold Jul 11 '12 at 17:47

source share

1 answer

Artur nowak · Accepted Answer · 2012-07-12T08:29:18+0000

Solr StandardTokenizer discards all special characters as it is optimized for use with clear text. So, for example, '=' will not be found because it is removed from the text during indexing.

One of the tokens that store all characters is the WhitespaceTokenizer , which only splits input into spaces. You need to evaluate if this is a good solution to your problem, as it will issue tokens as follows:

A 20 year old fox jumps over a lazy dog. → “20 year old,” “fox,” “jumping,” “above,” “ribbon,” “dog.”

You may need to provide your own tokenizer (optionally, you can use it to determine the appropriate regular expression for shared characters and use a PatternTokenizer ) or use a filter such as WordDelimiterFilter or PatternReplaceFilter .

Sunspot / Solr: Non Alphabetical Characters

More articles: