SOLR does not search for specific fields

Just installed Solr, edited schema.xml , and now I'm trying to index it and look for some test data on it.

In the XML file that I submit to Solr, one of my fields looks like this:

 <field name="PageContent"><![CDATA[<p>some text in a paragrah tag</p>]]></field> 

There is HTML there, so I wrapped it in CDATA.

In my Solr schema.xml definition for this field is as follows:

 <field name="PageContent" type="text" indexed="true" stored="true"/> 

When I started the POSTing tool, everything went fine, but when I search for content that, as I know, is inside the PageContent field, I get no results.

However, when I set the <defaultSearchField> node to PageContent , it works. But if I set it to any other field, it does not search in PageContent .

Am I doing something wrong? what is the problem?


To clarify the error:

I downloaded "doc" with the following data:

 <field name="PageID">928</field> <field name="PageName">some name</field> <field name="PageContent"><![CDATA[<p>html content</p>]]></field> 

In my schema, I defined the fields as such:

 <field name="PageID" type="integer" indexed="true" stored="true" required="true"/> <field name="PageName" type="text" indexed="true" stored="true"/> <field name="PageContent" type="text" indexed="true" stored="true"/> 

and

 <uniqueKey>PageID</uniqueKey> <defaultSearchField>PageName</defaultSearchField> 

Now when I use the Solr administration tool and look for " some name ", I get the result. But, if I search for " html content ", " html ", " content " or " 928 ", I get no results

Why?

+6
indexing solr
source share
5 answers

You mentioned that the default search field is set to PageName, I would not expect the search for "content" to return anything.

You probably wanted to put "PageContent: content" in the search field to find the data in this field. If you want to search across multiple fields, you should check this out at http://wiki.apache.org/solr/DisMaxRequestHandler . The solr admin console is not a tool that can be used with all DisMax search options, you just need to manipulate the URL for this.

Regardless of the fact that I agree with the previous poster, if your analysis setup is not configured correctly to deal with HTML, you are likely to get all kinds of unexpected search results. Separate the HTML code and specify only the text.

If you want your standard request handler to search all your fields, you can change it in your solrconfig.xml file (I always add a second request handler instead of changing the β€œstandard” one. The qf field is a list of the fields you want for Search: This is a space-separated list.

 <requestHandler name="standard" class="solr.DisMaxRequestHandler"> <lst name="defaults"> <str name="echoParams">all</str> <str name="hl">true</str> <str name="fl">*</str> <str name="qf">PageName PageContent</str> </lst> </requestHandler> 
+7
source share

Are you sure that your data was made before you try to search on it, right?

Also, if you want to store raw HTML, it is probably best to remove the HTML. You can do this in your application or using Solr solr.HTMLStripWhitespaceTokenizerFactory, for example:

 <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 

which you declare in the definition of your text field. You might want to create a new field type just for your html, maybe something like text_html, and you can use it like this:

 <fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype> 

I'm not sure what you mean by this:

However, when I set node to PageContent, it works. But if I install it in any other field, it does not search in PageContent.

Could you clarify?

+1
source share

fl - a list of fields returned by the request. qf is the list you would like to reference, and it does not support wild cards.

The only way to search all fields without involving them is to have a CopyField instance that catches all values ​​(only indexed is not stored), and then simulates a search across all fields by searching through it

+1
source share

In my schema.xml, I have something like the following that copies the value of each field ending in _t into a text field.

 <defaultSearchField>text</defaultSearchField> <copyField source="*_t" dest="text" maxChars="3000"/> 
0
source share

The fl parameter specifies not fields for the request, but fields for the response in response.

You can simply add to schema.xml :

 <field name="fieldContainingEverything" type="text" indexed="true" stored="true" multiValued="true" /> <defaultSearchField>fieldContainingEverything</defaultSearchField> <copyField source="*" dest="fieldContainingEverything" maxChars="3000"/> 

Now, when indexing, each field is copied to fieldContainingEverything . The problem here is that you lose information about the field the content comes from if you want to evaluate this information in more detail. I would be glad if someone had an idea about this.


I found a somewhat functional solution:

To describe a scenario with more detailed information: I have a MySQL database table with a large number of fields for indexing, and do this by simply importing each field without specifying each field ( SELECT * FROM ...). I want to query the index for each field in the table and find out which field matches the query. This is not possible out of the box, as the marker simply tells you that the field matching the request is fieldContainingEverything . With the help of the smax processing handler, I found that, despite the fact that, as they say, it searches in each field, it seems to me that it does not search for it to search for fields that are not specified in the qf parameter. The idea now is to additionally index each field by adding:

 <dynamicField name="*" type="string" indexed="true" stored="true"/> 

to your schema.xml . Now, when you request Solr via smax with hl.true&hl.fl=* , you add qf=fieldContainingEverything^1 to the parameter list. Solr now searches for each indexed field, but also selects each field that contains the query. The disadvantage of these methods, obviously, is the increased size of the index, which should not be so significant in most cases, I assume.

0
source share

All Articles