Search in solr with special characters

I have a problem with search with special characters in solr. My document has a "title" field, and sometimes it can be like "Titanic - 1999" (it has a "-" symbol). When I try to search in solr with a "-", I get a 400 error. I tried to escape from the character, so I tried something like "-" and "\ -". With this change, solr does not respond to me with an error, but returns 0 results.

As I can find in the solr-admin using this special character (something like "-" or ""

Hi

UPDATE Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375

My search relates to the "Title" field.

excerpt from schema.xml:

... <!-- A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms. --> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> ... <field name="Title" type="text_general" indexed="true" stored="true"/> 
+7
search special-characters full-text-search lucene solr
source share
3 answers

You use the standard text_general field for the title attribute. This may not be a good choice. text_general intended for huge fragments of text (or, at least, sentences), and not for exact matching of names or titles.

The problem here is that text_general uses a StandardTokenizerFactory .

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

StandardTokenizerFactory performs the following actions:

A good universal tokenizer that skips a lot of extraneous characters and sets the types of tokens for significant values. Token types are only useful for subsequent token filters that are familiar with the type of the same token types.

This means that the "-" character will be completely ignored and used to tokenize the string.

"kong-fu" will be presented as "kong" and "fu". "-" disappears.

This also explains why select?q=title:\- doesn't work here.

Choose a more suitable field type:

Instead of StandardTokenizerFactory you can use solr.WhitespaceTokenizerFactory , which is split only into spaces for exact word matching. Thus, creating your own field type for the title attribute will be the solution.

Solr also has a miniature field type called text_ws . Depending on your requirements, this may be sufficient.

+7
source share

I spent a lot of time on this. Below are clear step-by-step instructions for requesting special characters in SolR. Hope this helps someone.

  • Edit the schema.xml file and find the solr.TextField that you are using.
  • In both parsers, the "index" and the "query" change the WordDelimiterFilterFactory and add types="characters.txt" Something like:

     <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/> </analyzer> </fieldType> 
  • Make sure you use the WhitespaceTokenizerFactory as a tokenizer as shown above.

  • The characters.txt file may contain entries like -

      \# => ALPHA @ => ALPHA \u0023 => ALPHA ie:- pointing to ALPHA only. 
  • Clear data, reindex and query the entered characters. This will work.

+1
source share

To search for your exact phrase, put quotation marks there:

 select?q=title:"Titanic - 1999" 

If you just want to find this special character, you will need to avoid it:

 select?q=title:\- 

Also check: Special characters (- & +, etc.) do not work in the SOLR request

If you know exactly which special characters you do not want to use, you can add this to the regex-normalize.xml file

 <regex> <pattern>&#x2D;</pattern> <substitution>%2D</substitution> </regex> 

This will replace all the "-" with% 2D, so when searching, if you search for% 2D instead of "-", it will work fine

0
source share

All Articles