Elasticsearch: filter order for better performance

The Elasticsearch manual says:

"Each filter is computed and cached independently, regardless of where it is used. If two different requests use the same filter, the same set of filter bits will be reused. Similarly, if one request uses the same filter in several places, only one bit set is computed and then reused. " ( https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html )

on another page, he also says:

"The order of the filters in the bool clause is important for performance. Filters with more specific filters should be placed in front of less specific filters to exclude as many documents as possible. If A could correspond to 10 million documents and B could only match 100 documents, then paragraph B should be placed before paragraph A. " ( https://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html )

I don’t quite understand how the order of the filters in the bool section is important when each filter is cached independently.

I would suggest that Clause B is executed or retrieved from the cache, clause A is executed or retrieved from the cache, and then the filter bits are combined. Why should a question matter?

+6
source share
4 answers

This guide is a little misleading. This is more complicated, and it’s very difficult to try to write one set of rules that is suitable for all situations. As data changes, the rules change. As the types of queries and filters change, the rules change. A particular filter may work more slowly than a wide one; the rules change. In each segment, the size of the filter result may be opposite than on the other segment; it is not always predictable. Therefore, first you need to understand more internal matters, then you need to let go of trying to control it when you upgrade to modern Elasticsearch 2.x.

NOTE. the second quote (filtering order) and the link associated with it refer to a page that is considered outdated for Elasticsearch 2.x, it will be updated later. Therefore, advice may or may not apply to modernity.

Looking back at Elasticsearch 1.x and the reason for the offer:

First, tell how the filters are represented in memory. They are either a duplicate list of matching documents, or the "it's here" random access model. Depending on the type of filter, it depends on which one is more efficient. Now, if everything is cached, you just cross them, and the cost will depend on size and type.

If the filters are not cached but cached, the filter will be executed independently, and previous filters will affect it only on the total cost of the intersection.

If the filter is NOT cached, then it MAY be guided by previous results. Imagine Query plus a Filter . If you run the query and after applying the filter, you do a lot of extra work if the filter is limited to a very small set of records. You spent time querying, collecting, scoring, and generally built a large set of results. But if you convert to FilteredQuery and execute both at the same time, then Query ignores all records already eliminated with Filter . He should only consider the same documents that are already playing. This is called a ā€œpassā€. Not all types of filters use omissions, but some may. And that’s why a smaller ā€œdirectingā€ filter will force others to use it faster.

If you don’t know each type of filter, the heuristic of your data and how each specific filter will be affected by each of them, you simply don’t have enough information but to say ā€œfirst set the most restrictive filters, and larger ones the secondā€ and hope that this will work. For bool , the default value does not cache its overall result, so you need to pay attention to its repeated performance (and / or cache memory). This is more effective when one side of the filter crossing is small. Therefore, having a small one to start, all other intersections are faster, because they can only decrease. If it were a bool request instead of a filter that does scoring, the more important it is to avoid scoring more documents than necessary.

Another important note: the ā€œmost specific filter at firstā€ can sometimes be slow (a script filter or another), so it should really read: ā€œat first the cheapest, most specific filtersā€.

With Elasticsearch 2.0, everything will change :

It's time to forget everything you knew about queries and filters: Elasticsearch 2.0 alone will make much better decisions, rather than relying on users to formulate an optimized query.

In 2.x, you should try playing less on the system, and let the engine make a better choice. In fact, the engine may have something completely different under the hood, a rewritten filter, a complete change in the internal structure and data. And you can no longer control caching. Therefore, you need to know more about this.

The previous API filter could be used in two ways: either using iterators over comparable documents, or using an optional random access API that allowed you to check whether a particular document matches the filter or not. Everything is still fine, except that the best way to use the filter depends on which filter you had: for example, the script filter was more efficient using the random access API, while the bool filter was more efficient using the API iterator interface. It was quite a nightmare for optimization, and this was the main reason why the bool filter, on the one hand, and the and and or filters, on the other hand, performed differently.

Now the engine will decide that it is best to consider more factors, including scoring, estimating the size of the result, the best way to cross related filters, perhaps even based on each segment, etc.

Also in this article it is clear that even caching can be misleading, it does not always speed up the work. Sometimes the internal data structure is better than originally used than the bit structure, which is always cached. So in 2.x it changes to avoid caching things that are better done from their own data structure without caching at all.

The blog post Roaring Bitmap Details:

Obviously, the most important requirement is to have something fast: if your cached filter runs slower than filter execution, it not only consumes memory, but also slows down your queries. The more complex the encoding, the more likely it is to slow down the encoding and decoding due to increased CPU usage.

Here you get a lot of information about internal data structures, caching, crossroads and much more about internal changes in 2.x, which will help you better understand your filter characteristics.

Although you may surprise you if you are new to the search engine, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers.

From these last few links on blog 2.x, you have a lot of information about your question, they talk about all the problems that you are trying to get around with filter settings. The information and details are all there, and you can better understand 1.x vs 2.x and how issues + filters are resolved. Therefore, remember:

There is no particular implementation that is constantly better than everyone else.

See also these 1.x resources for historical reference:

  • Search Engine Optimization Elasticsearch describes the filtering order in more detail. It says:

    However, you still need to think about which order you are filtering. You want more selective filters to run first. Let's say you filter by type: book and tag: elasticsearch. If you have 30 million documents, 10 million type books, and only 10 Elasticsearch tags, you want to apply the tag filter first. This reduces the number of documents much more than a book filter.

  • All About Elticearch Filter Bitsets is considered an obsolete article for modern times, but gives more order information for ordering a filter.

  • The Martijn v Groningen forum response seems to suggest otherwise with respect to bool vs. queries and , which use iteration and random access, but the idea is the same for everyone: to be safe by restricting documents earlier in the filter list - regardless of which model for one type differs from another.

+11
source

Not all filters are cached / cached. For example, a date range filter that uses the now variable is not cached because it changes all the time. if you look a little further in the first link you provided, you will see a section called ā€œManaging Cachingā€ that says this:

Some leaf filters, however, are not cached by default, because it does not make sense to do this: script filters, geo filters, date range filters.

To illustrate this, let's say we have the following range date filter (let it call filter A), which filters all documents in the last month

 "range" : { "timestamp" : { "gt" : "now-1m" } } 

and another term filter (let it filter B) filter documents with type XYZ

 "term" : { "type" : "XYZ" } 

It makes a big difference (performance wise) if you post

  • filter A before filter B or
  • filter B before filter A

In case 1, execution will be slower because all documents in the last month must first go through filter A, which is not cached.

In case 2, you first filter out all documents without type XYZ , which happens quickly because filter B is cached. Then the documents that did this through filter B can go through filter A. So, although filter A is not cached, execution will still be faster because there are fewer documents left in the filter pipeline.

This was a very simple example, but it should show why the order of the filter matters, i.e. mainly because some filters are not cached . You can change this default behavior by causing caching, but sometimes this is not a good idea. Best practice is to first apply the most aggressive filters to allow as few documents as possible to go through the next filter.

I personally call this the ā€œbulldozer approachā€, i.e. First, be sure to process as much material as possible in the filter conveyor, and you will end up with more chewy pieces of data that can be processed faster.

+1
source

this blog post on a resilient website , published in May 2017, reports

Q: Is the order in which I put my queries / filters in a DSL query case?

A: No, because they will be automatically reordered anyway based on their respective costs and comparable costs.

0
source

I believe that combining a smaller set of consistent documents with a larger set more efficiently or working in that order gives a higher chance of zero coincidence and thus short reductions can be made. You need to check the source code ( Elasticsearch and Lucene ) to know for sure.

Correct me if I am wrong ...

-1
source

All Articles