Lucene query behavior analysis - combining query parts with AND

Say we have a Lucene index that has several documents indexed using StopAnalyzer.ENGLISH_STOP_WORDS_SET . The user issues two requests:

  • foo:bar
  • baz:"there is"

Suppose the first query yields some results because there are documents matching this query.

The second query yields 0 results. The reason for this is that when analyzing baz:"there is" it ends as a void request, since there are and are temporary words (from a technical point of view, this is converted to an empty BooleanQuery without sentences). So far so good.

However, any of the following combined queries

  • +foo:bar +baz:"there is"
  • foo:bar AND baz:"there is"

behaves exactly the same as the query +foo:bar , that is, it returns some results - all, despite the second part of AND , which does not produce results.

It can be argued that with ANDing both conditions must be met, but they are not.

It seems controversial because the atomic query component has a different effect on the overall query depending on the context. Is there a logical explanation for this? Can this be solved in any way, preferably without writing your own QueryAnalyzer ? Could this be classified as a Lucene bug?

If that matters, the observed behavior occurs in Lucene v3.0.2.

This question was also posted on the Lucene Java user mailing list , there have been no answers yet.

+4
source share
3 answers

Eric Ericson of the Lucene mailing list answered in one part of this question:

But imagine the impact you have on the request. If all the stop words get deleted, then the request will never match yours. Which would be a very counterintuitive IMO. Your users don’t have a clue that you deleted the stop words, so they will sit there saying “Look, I KNOW that the“ bar ”was in foo, and I KNOW that there was“ in the database ”, why the hell is this damn the system did not find my dock?

Thus, it seems that the only sensible way is to stop using stop words or reduce the set of pauses.

0
source

I would suggest not using StopAnalyzer if you want to find phrases like "yes." StopAnalyzer is essentially a method of optimizing losses, and if you are not indexing huge text documents, this is probably not worth it.

0
source

I think this is great. You can submit the result for an empty query, which is the entire collection of documents. However, for practical reasons, this result is omitted. Sonia is basically you ANDING with a superset is not an empty set.

E: You can think of it so that additional keywords refine the set of results. This makes the most sense when you accept the prefix search. The shorter your prefix, the more matches. The most extreme case is an empty request matching the entire collection of documents.

0
source

All Articles