Sunspot / Solr: Non Alphabetical Characters

I am using Solr with Sunspot / dismax. Can non-alphabetic characters be requested? Ie:

~ ! @ # $ % ^ & * ( ) _ + - = [ ] { } | \

I know that +/- should be escaped since they are inconvenient inclusion / exclusion operators. But I don't get matches when looking for any of these characters:

 Foo.search { fulltext '=' }.results.length # => 0 Foo.search { fulltext '\=' }.results.length # => 0 

Nonetheless:

 Foo.search { fulltext 'a'}.results.length # => 30 

Here is the tokenizer configuration I'm using:

  <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 
+4
source share
1 answer

Solr StandardTokenizer discards all special characters as it is optimized for use with clear text. So, for example, '=' will not be found because it is removed from the text during indexing.

One of the tokens that store all characters is the WhitespaceTokenizer , which only splits input into spaces. You need to evaluate if this is a good solution to your problem, as it will issue tokens as follows:

A 20 year old fox jumps over a lazy dog. โ†’ โ€œ20 year old,โ€ โ€œfox,โ€ โ€œjumping,โ€ โ€œabove,โ€ โ€œribbon,โ€ โ€œdog.โ€

You may need to provide your own tokenizer (optionally, you can use it to determine the appropriate regular expression for shared characters and use a PatternTokenizer ) or use a filter such as WordDelimiterFilter or PatternReplaceFilter .

+3
source

All Articles