Find terms in the index that are the search query prefix or vice versa (!)

I would like Lucene to find a document containing the term "bahnhofstr" if I search for "bahnhofstrasse", that is, I not only want to find documents containing conditions in which my search term is a prefix, but also documents that contain terms that themselves are the prefix of my search query ...

How can i do this?

+7
source share
2 answers

If you understood correctly, and your search string is an exact string, you can set queryParser.setAllowLeadingWildcard(true); in Lucene, to allow wildcard searches (which may or may not be slow), I saw them fast enough, but in the case where there were only 60,000+ Lucene documents).

An example query syntax might look something like this:

 *bahnhofstr bahnhofstr* 

or perhaps (did not check this):

 *bahnhofstr* 
+1
source

I think that a fuzzy query may be most useful to you. This will score deadlines based on the Levenshtein distance from your request. Without specifying minimal similarity, it will effectively correspond to every available term. This may make it less effective, but it does what you are looking for.

A fuzzy request is signaled by the ~ symbol, for example:

 firstname:bahnhofstr~ 

Or with minimal similarity (a number from 0 to 1, 0 is the weakest without a minimum)

 firstname:bahnhofstr~0.4 

Or, if you create your own queries, use FuzzyQuery

This is not exactly what you indicated, but it is the easiest way to get closer.

How accurate you are looking for, I don't know a simple Lucene call to execute it. I would probably just divide this term into a series of termqueries that you could represent in the query line something like:

 firstname:b firstname:ba firstname:bah firstname:bahn firstname:bahnh firstname:bahnho firstname:bahnhof firstname:bahnhofs firstname:bahnhofst firstname:bahnhofstr* 

I would not actually create a query string for it. I just created the TermQuery and PrefixQuery objects myself.

The scoring will be a bit deformed, and I would probably increase longer queries to better organize it, but the method that comes to mind to achieve exactly what you are looking for is pretty easy. A DisjunctionMaxQuery will help you use something similar with other terms and get a more reasonable result.

Hope the fuzzy query works well for you. Seems to be a much nicer solution.

Another option, if you have a great need for queries of this nature, maybe, when indexing, tokenize fields in n-grams (see NGramTokenizer ), which will effectively use NGramPhraseQuery to achieve the desired results.

0
source

All Articles