Solr Association

For the last couple of days, we have been thinking of using Solr as our search engine. Most of the features we need are out of the box or can be easily configured. However, there is one feature that we absolutely need, which seems to be well hidden (or missing) in Solr.

I will try to explain with an example. We have many documents that are actually enterprises:

<document> <name>Apache</name> <cat>1</cat> ... </document> <document> <name>McDonalds</name> <cat>2</cat> ... </document> 

In addition, we have another xml file with all categories and synonyms:

 <cat id=1> <name>software</name> <synonym>IT<synonym> </cat> <cat id=2> <name>fast food</name> <synonym>restaurant<synonym> </cat> 

We want to link both companies and categories so that we can search using the name and / or synonyms of the category. But we do not want to merge these files during indexing, because we must update the categories (add. Correction of synonyms ...) without indexing all enterprises again.

Is there anything in Solr that makes such associations, or do we need to develop some specific fragments?

All feedback and suggestions are welcome.

Thanks in advance Tom

+6
search-engine lucene solr
source share
4 answers

In principle, there is a design solution. The common thing people do with Solr indices is to denormalize them, that is, explode the category definition in a business document. Since you do not want to do this, I suggest storing two types of documents - one for enterprises and one for categories. You can store both in the same index, since Solr does not require all documents to have the same fields. Business documents look simple, but you must make them searchable by company name and category identifier. I suggest creating a category document for each synonym, where you search by synonym and find the identifier (and category name).

To search using synonyms you will need a double search -

  • Search for category identifier using name text.
  • Search for businesses using the category identifier.
+4
source share

Actually there is a filter class solr.SynonymFilterFactory .

This should allow you to match the cat numbers with your 2 text equivalents if you only use them in the query analyzer, something like the following:

  <fieldType name="category" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="category_Synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

This way you can index ONLY the category identifier. This means that you will not send all companies to Solr again. Also, if someone asks for β€œsoftware” or β€œIT,” he matches it with a category

Your category_Synonyms.txt should have lines like:

1, software, IT

The only thing you see here is that you have to come up with a way to edit a text document when changing names or synonyms. So I think this will help if you change category names infrequently ?? If someone else does not know how to do this easily.

I added this above to my own solr and ran the Analyzer tool. Here is the result:

alt text

As you can see, this has turned software into

one

Please note that you MUST install

expansion

for

falsely

Hope this helps.

Dave

+2
source share

You cannot find unprocessed pieces of information if you do not implement any query translation / extension that translates some query terms into their indexed equivalent before sending the request.

So, if the user calls "restaurant", then your request is converted to enable the filter with cat = 1.

As far as I know, Solr does not include this function, so you should implement it yourself or adapt a suitable module (for example, http://lucene-qe.sourceforge.net/ ).

0
source share

In addition to some of the excellent ideas suggested earlier, you can also look at multi-valued fields. Thus, the field of your category can contain any number of values ​​(and if necessary be updated), when you perform a search, it asks for all values.

0
source share

All Articles