I have a database of URLs that I would like to search. Since URLs are not always spelled the same way (maybe or don't have www), I'm looking for the right way to URLs and requests. I tried several things and I think I'm close, but not sure why this is not working:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexing will have the following words in different positions: http, www, twitter, com, andersoncooper
If I'm just looking for twitter.com/andersoncooper, I would like this query to match the indexed entry, so I also use WDF to separate the search query, however the search query ends like this:
myfield :( "twitter com andersoncooper") when he really wants it to match all records that have all of the following separate words: twitter com andersoncooper
Is there any other request filter or tokenizer that I should use?
KidA78
source share