Indexing and Querying URLS in Solr

I have a database of URLs that I would like to search. Since URLs are not always spelled the same way (maybe or don't have www), I'm looking for the right way to URLs and requests. I tried several things and I think I'm close, but not sure why this is not working:

Here is my custom field type:

<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

For example:

http://www.twitter.com/AndersonCooper when indexing will have the following words in different positions: http, www, twitter, com, andersoncooper

If I'm just looking for twitter.com/andersoncooper, I would like this query to match the indexed entry, so I also use WDF to separate the search query, however the search query ends like this:

myfield :( "twitter com andersoncooper") when he really wants it to match all records that have all of the following separate words: twitter com andersoncooper

Is there any other request filter or tokenizer that I should use?

+7
source share
3 answers

This should be the easiest solution:

 <field name="iconUrl" type="string" indexed="true" stored="true" /> 

But for you you will need to make it ambiguous and index it 1. no change 2. no http 3. no www

or make the search url accessible with wildcards in front (which is slower, I think)

0
source

If I understand this statement from your question

myfield :( "twitter com andersoncooper") when he really wants it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that matches both:

 http://www.twitter.com/AndersonCooper 

and

 http://www.andersoncooper.com/socialmedia/twitter 

(both links contain all tokens), but do not match either

 http://www.facebook.com/AndersonCooper 

or

 http://www.twitter.com/AliceCooper 

If this is correct, your existing configuration should work fine. Assuming you are using a standard query parser, and you are requesting through curl or some other mechanism based on URLs, you need a query parameter to look like this:

 &q=myField:andersoncooper AND myField:twitter AND myField:com 

One of the errors that may have been disabled is that the default query operator (between the terms in the query) is "OR", therefore AND should be explicitly stated above. Alternatively, to save some space, you can change the default query statement to AND as follows:

 &q.op=AND&q=myField:(andersoncooper twitter com) 
0
source

You can try the tokenizer keyword

From the Solr 1.4 Enterprise Search Server book published by Packt

KeywordTokenizerFactory: it doesnโ€™t actually do any tokenization or anything at all! This returns the source text as a single word. There are times when you have a field that always gets one word, but you need to do basic analysis as the bottom. However, this is more likely due to the sorting or cut requirements that you require an indexed field with at most one term. Of course, a document identifier field, if included, rather than a number, will use this.

-one
source

All Articles