Elastic search - single index versus multiple indexes?

Question

Elastic search - single index versus multiple indexes?

I am working on a solution for storing application logs in Elastic Search for many applications in many development teams. The structure of each log entry is identical to the application field for specifying the application.

Goal # 1 is to maintain an effective query within a single “application”. The query in all applications, although important, will be secondary.

I am trying to determine which is better:

EDIT: in both cases I will use time indices.

multiple index rows

Each “application” will have a number of time-based indexes (app1-2017-04-01, app1-2017-04-02, ... etc.). The user will search directly against these smaller indexes. The idea here is that since indexes are smaller in size, maybe queries are faster?

single index series

Use one giant row of indexes to represent all application logs (e.g. logs-2017-04-01, logs-2017-04-02, ... etc.). Users will request an “application” field to narrow their search results.

What is faster in this case? I'm curious overhead of additional indexes

+7

elasticsearch

bradforj287 Jun 22 '17 at 12:56

source share

4 answers

Random · Answer 1 · 2017-06-27T20:44:00+0000

In most cases, it is better to use several indexes:

Find a smaller data set quickly
You are less limited in the structure of matching. If you need to change it for new data, you can store old data without overriding and just put a new mapping for the new index
It is more scalable and flexible. You can store old indexes on another hard drive or on another computer.
If necessary, you can search multiple indexes.
The overhead for the index is small. If you have a lot of index documents, documents take up much more space than index metadata. If not, you can shorten the time span for splitting log indexes

federicojasson · Answer 2 · 2017-06-23T03:27:15+0000

Saving different indexes for different applications gives you flexibility and, ultimately, helps you improve performance by adjusting the number of shards / replicas for each application. In any case, you can always allow cross-searching by specifying aliases or simply using wildcards.

Given that several teams will gain access to data, storing different indexes for different applications is also more clear. Finally, if you ultimately want to add some kind of access control (using Shield / X-Pack), having different indexes will definitely make things easier.

Bruno dos santos · Answer 3 · 2017-06-29T15:11:18+0000

In terms of performance, it is better to use a large index than several small indexes, as you can see in the article Index vs. Adrien Grand .

The index is stored in a set of shards, which themselves are Lucene indexes. This already gives you an idea of the possibilities of using the new index constantly: Lucene indexes have small but constant overheads in terms of disk space, memory usage and file descriptors. For this reason, one large index is more efficient than several small indexes: the fixed price of the Lucene index is better depreciated by many documents.

My suggestion is to use one temporary index for all applications, where each application is a different type of your index. This will make it easier for you to search every application log and so easily when searching for all applications at once.

For example:

If you want to search in only one application, you can use:

http://yourserver:9200/logs-2017-04-01/app1/_search

And for all applications:

http://yourserver:9200/logs-2017-04-01/_search

Another good point to evaluate is that each application can have a different number of log entries. Thus, if you have one different index for each application, it will be so difficult to determine the size of your fragments for each of them. For this reason, using only one index will make it easier for you to select a cluster size. If the index is too large, just divide it into more fragments.

Andrei Stefan · Answer 4 · 2017-07-01T14:34:49+0000

I will provide a hypothetical guide since you decided to ignore the answers to my questions.

When it comes to using a log (time-based indexes), you need to have some data on future plans at hand: how long do you want to keep logging data around (storage period), what will be the use of the template for the collected data (query frequency, indexing frequency), how much data will be every day (see here data on disk, as well as font size). Before thinking about the "per-app-index" or "single-index" problem, consider the tips below. After you do the math regarding the size of the fragments, how much time will be for the selected storage period, you can think of each application or a single index.

Depending on the size of the fragments, especially the storage period, secondly, you need to consider whether the indexes are based on time, daily, weekly or monthly. A good rule of large size for a fragment size is a maximum of 30-50 GB, any recovery, moving fragments, searching will be potentially slower and potentially affect cluster stability.

If your applications are capable of generating large amounts of data daily that exceed the number mentioned above, do not select indexes for each application. If the size is smaller, then again it depends. A huge number of fragments on one node consumes resources and makes searching slow. Each shard has a fixed set of memory, which is used only because it exists. In addition, when performing a search, each shard will perform a search on a single thread. One thread is basically one CPU core. The higher the time interval used in search queries (the greater the number of indexes), the greater the number of simultaneous searches occur, the higher the context switching at the OS level between several threads trying to use CPU cores. In general, do not try to squeeze hundreds of fragments into one node , if only some of them will be used at any given time. If you plan most often to request all the data in your cluster, the number of skulls that you would like to have on each node is drastically reduced. Otherwise, your cluster will not be able to handle the load.

If your example of using the journal is one that basically has high activity according to the most recent data (from the last few days to one week), consider the approach of a warm warm architecture: https://www.elastic.co/blog/hot-warm -architecture-in-elasticsearch-5-x

The exercise of creating and configuring a cluster always includes testing. Therefore, please try to check the effectiveness of your queries on pieces of data that are as identical as possible to real data. Also, do this on a single node that has hardware specifications for nodes in the production cluster.

Elastic search - single index versus multiple indexes?

More articles: