MongoDB 2.6 Index tuning, query using $ or, $ in, with limit and sort

I have a somewhat complicated request that is very critical for my application.

$cur = $col->find( array ( '$or' => array( array('owner' => $my_id), array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F'))), array('owner' => array('$in' => $friends), 'perm.list' => $my_id) ) ) )->limit(10)->skip(0)->sort(array('ca' => -1)); 

The goal is to find the first 10 posts sorted by the time they were created in desc order, which:

a). made by me or b). made by my friends with permissions like “P” for the public, or “F” for friends, or c). made by my friends, the permission list specifically designated me as a viewer.

The $ friends variable is an array of user IDs who are friends with me. Perm.type has a total of 4 values, which are "P", "F", "S", "C". perm.list is an array of user IDs that have permission to view this message.

The above query works as intended to filter the correct results. But I ran into the problem of creating effective indexes on them.

The indexes I created for this query are:

 $col->ensureIndex(array('owner' => 1, 'ca' => -1)); $col->ensureIndex(array('owner' => 1, 'perm.type' => 1, 'ca' => -1)); $col->ensureIndex(array('owner' => 1, 'perm.list' => 1, 'ca' => -1)); 

The first index is for the first part of the query criteria, the second index is for the 2nd criterion, and the third is for the 3rd criterion and is a multi-index.

A typical post would look like this:

 { "_id": "...", "owner": "001", "perm": { "type": "P", "list": [] }, "msg": "Nice dress!", "ca": 1390459269 } 

Another example:

 { "_id": "...", "owner": "007", "perm": { "type": "C", "list": ["001", "005"] }, "msg": "Nice day!", "ca": 1390837209 } 

I know the limitation that existed before MongoDB 2.6, which prevents the use of indexes when combining $ or with sort (). The question according to this http://jira.mongodb.org/browse/SERVER-1205 should be fixed in 2.6.

And, of course, explain () now shows the use of my indexes, where before it was not in 2.4. But when I ran the query, it is now much slower than when it did not use any indexes. Explanation () showed that nscanned is much higher than expected. After some searching, I found this question https://jira.mongodb.org/browse/SERVER-3310 , which seems to explain the problem I am having. But, as stated in the statement, this question should have been fixed in 2.5.5, so what causes my problem here?

I tried to set up different indexes, exacerbating them in different orders, even separating them, checking if the new index intersection function would help. But no one worked.

Does anyone know my problem is here?

Edit After more testing, observation, and thinking, I narrowed down the problem, and it really uses $ in, limit (), and sort () all together in one request, which causes the problem. Adding a top level of '$ or' simply doubles this problem for every $ or sentence. I will explain my logic below:

I updated my indexes as follows:

 $col->ensureIndex(array('owner._id' => 1, 'ca' => -1, 'perm.type' => 1)); $col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner._id' => 1)) 

The rationale for the first index is when I have millions of records, first the query should start with a given set of user identifiers (friends) in order to narrow down the choice. He then goes through it in reverse chronological order of entries to check whether each type of rights has. The problem with this index is that the query optimizer has no idea how many records need to be scanned to satisfy the constraint condition (10). He does not know where the 10 most recent entries will eventually appear, so he must return up to 10 entries from each identifier specified in the $ in section, and then repeat the same thing for each $ or. Therefore, if I have two sentences "$ or", each of which has a "$ in" consisting of 100 user IDs, then he will have to scan enough entries to match 10 entries from each of the users in "$ in" first " $ or ", and then 10 entries from each of the users in the" $ in "of the second" $ or ", giving a return of 2000 entries (this n is returned in the explanation, and nscanned will be much higher depending on how many entries you need to scan to find matches 2000), and of these 2000 entries, all are chronologically ordered e already, the top 10 returns.

So what if I create the index in the following order: "'ca' => -1, 'owner._id' => 1, 'perm.type' => 1"? Well, I can’t do this, because when I have hundreds of thousands of users, with millions of records, most of the records will be irrelevant to the viewer. Therefore, if I first start with "ca" => -1, it scans a lot of irrelevant records before it hits one that matches the criteria, even if every hit that it detects counts directly against the limit (10), and it will only need to scan as many records as necessary to match 10 records that meet the criteria. But this scan can be tens of thousands of records or even more. Worst of all, if 10 records cannot be found, he will have to go through the entire index to find out.

The second pointer is to look at each entry that is indicated for me, go through it in reverse chronological order and check if my friends have it. This is pretty straight forward, and the problem here is that it has to do with a combination of using this: with '$ in', limit () and sort () on top, all together in one request.

At this point, I am considering solutions for combining results from the application side, but splitting "$" or "on the application side" is easy, but how to split "$ in" into an array of criteria ('owner' => array ('$ in' => $ friends), 'perm.type' => array ('$ in' => array ('P', 'F')))?

+6
source share
3 answers

After 3 days of testing and research, the reason that causes inefficient requests is now clear. MongoDB in the current version (2.6.1) still cannot optimize queries that use both $ or, $ in, limit () and sort () at the same time. https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixed, each improved performance for queries that have 3 of 4 of the above operations. When the 4th operation is introduced into the query, the optimization leaves the window. This behavior is observed with a full index and viewing documents within $ or a proposal, even if a limit is specified (10).

Attempting to solve this problem by splitting $ or sentences separately and merging the results on the application side, although possible, ran into serious obstacles when I tried to implement pagination.

So my current solution is to come up with an equivalent query to the original query, while only 3 out of 4 operations. I decided to "flatten" the "$ in" operator, turn each element into an $ friends array into another "$ or" condition with the exact owner value that will be requested. So instead of having 3 '$ or' conditions in my original request, I now have so many $ or conditions that I have elements in my $ friends array, plus 2 other $ or initial conditions.

Now the request is optimized. When you ran using the explain () function, nscannedObjects and nscanned were now reduced to the values ​​that they should be. Given the documentation for "$ or", specifying

When using indexes with $ or queries, each $ or item will be executed in parallel. These sentences can use each index.

This may indeed be an acceptable performance solution. Hope this helps anyone who is facing the same issues as me.

0
source

I'm not sure if this is a bug in MongoDB 2.6, but you can take a look at this article on index creation.

The order of the fields in the index should be:

 1. First, fields on which you will query for exact values. 2. Second, fields on which you will sort. 3. Finally, fields on which you will query for a range of values. 

So, following these tips, you can try using these indices:

 $col->ensureIndex(array('owner' => 1, 'ca' => -1)); $col->ensureIndex(array('ca' => -1, 'owner' => 1, 'perm.type' => 1)); $col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner' => 1)); 

Edit:

From your explanation, if you are testing on small datasets, a complete build is quick because MongoDB does not need to go through many documents. You should try to run a test, such as 10,000 documents, to see the real difference. The values ​​for your fields in the indexes must be sufficiently different to ensure the selectivity of the index for your queries (for example, not all documents belong to the same owner).

+1
source

TL DR: I believe that you are using the wrong algorithm / data structure for the tool or vice versa. I would suggest using a fanatical approach as the SO being discussed on this issue , or posting my blog . Sorry for the shameless advertising of my previous posts, but there is no point in repeating this information here.


The MongoDB philosophy, contrary to the typical SQL philosophy, is rather write-heavy . Essentially, you are trying to implement a ranking algorithm in a MongoDB query , but the MongoDB query philosophy is a “query by example”. This is not very convenient.

Of course, the aggregation pipeline is no longer in line with this philosophy, and everything can change. There are optimizations along the way that allow you to perform more complex queries, such as intersecting indexes.

However, what you are doing here is very difficult to control. Not only do you want MongoDB to use the index crossroads (new in version 2.6, it only works with two indexes at the moment), but you also combine it with $in queries and compound indexes. That's a lot to ask, and if the number of friends in $in grows too much, you still have good luck. The same is true if part of the news is shared with too many people, in the worst case, the document grows for 16 MB. Growing documents are expensive, complex demands are expensive, large documents are also expensive.

I suggest you use the disconnect approach for news feeds, where you can implement a very sophisticated ranking algorithm in code, not in MongoDB.

I’m not saying that it’s impossible to optimize your query, but since explain so hyenorman and so many effects are involved (typical array sizes, typical match ratio, selectivity of indices, etc.), it will be very difficult to find a good solution for this problem , even for those who have full access to the data (i.e. you).

Even if you earn it, you may encounter critical problems if your access patterns change, data changes, etc., so you will have to deal with a fragile design.

0
source

All Articles