I have a somewhat complicated request that is very critical for my application.
$cur = $col->find( array ( '$or' => array( array('owner' => $my_id), array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F'))), array('owner' => array('$in' => $friends), 'perm.list' => $my_id) ) ) )->limit(10)->skip(0)->sort(array('ca' => -1));
The goal is to find the first 10 posts sorted by the time they were created in desc order, which:
a). made by me or b). made by my friends with permissions like “P” for the public, or “F” for friends, or c). made by my friends, the permission list specifically designated me as a viewer.
The $ friends variable is an array of user IDs who are friends with me. Perm.type has a total of 4 values, which are "P", "F", "S", "C". perm.list is an array of user IDs that have permission to view this message.
The above query works as intended to filter the correct results. But I ran into the problem of creating effective indexes on them.
The indexes I created for this query are:
$col->ensureIndex(array('owner' => 1, 'ca' => -1)); $col->ensureIndex(array('owner' => 1, 'perm.type' => 1, 'ca' => -1)); $col->ensureIndex(array('owner' => 1, 'perm.list' => 1, 'ca' => -1));
The first index is for the first part of the query criteria, the second index is for the 2nd criterion, and the third is for the 3rd criterion and is a multi-index.
A typical post would look like this:
{ "_id": "...", "owner": "001", "perm": { "type": "P", "list": [] }, "msg": "Nice dress!", "ca": 1390459269 }
Another example:
{ "_id": "...", "owner": "007", "perm": { "type": "C", "list": ["001", "005"] }, "msg": "Nice day!", "ca": 1390837209 }
I know the limitation that existed before MongoDB 2.6, which prevents the use of indexes when combining $ or with sort (). The question according to this http://jira.mongodb.org/browse/SERVER-1205 should be fixed in 2.6.
And, of course, explain () now shows the use of my indexes, where before it was not in 2.4. But when I ran the query, it is now much slower than when it did not use any indexes. Explanation () showed that nscanned is much higher than expected. After some searching, I found this question https://jira.mongodb.org/browse/SERVER-3310 , which seems to explain the problem I am having. But, as stated in the statement, this question should have been fixed in 2.5.5, so what causes my problem here?
I tried to set up different indexes, exacerbating them in different orders, even separating them, checking if the new index intersection function would help. But no one worked.
Does anyone know my problem is here?
Edit After more testing, observation, and thinking, I narrowed down the problem, and it really uses $ in, limit (), and sort () all together in one request, which causes the problem. Adding a top level of '$ or' simply doubles this problem for every $ or sentence. I will explain my logic below:
I updated my indexes as follows:
$col->ensureIndex(array('owner._id' => 1, 'ca' => -1, 'perm.type' => 1)); $col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner._id' => 1))
The rationale for the first index is when I have millions of records, first the query should start with a given set of user identifiers (friends) in order to narrow down the choice. He then goes through it in reverse chronological order of entries to check whether each type of rights has. The problem with this index is that the query optimizer has no idea how many records need to be scanned to satisfy the constraint condition (10). He does not know where the 10 most recent entries will eventually appear, so he must return up to 10 entries from each identifier specified in the $ in section, and then repeat the same thing for each $ or. Therefore, if I have two sentences "$ or", each of which has a "$ in" consisting of 100 user IDs, then he will have to scan enough entries to match 10 entries from each of the users in "$ in" first " $ or ", and then 10 entries from each of the users in the" $ in "of the second" $ or ", giving a return of 2000 entries (this n is returned in the explanation, and nscanned will be much higher depending on how many entries you need to scan to find matches 2000), and of these 2000 entries, all are chronologically ordered e already, the top 10 returns.
So what if I create the index in the following order: "'ca' => -1, 'owner._id' => 1, 'perm.type' => 1"? Well, I can’t do this, because when I have hundreds of thousands of users, with millions of records, most of the records will be irrelevant to the viewer. Therefore, if I first start with "ca" => -1, it scans a lot of irrelevant records before it hits one that matches the criteria, even if every hit that it detects counts directly against the limit (10), and it will only need to scan as many records as necessary to match 10 records that meet the criteria. But this scan can be tens of thousands of records or even more. Worst of all, if 10 records cannot be found, he will have to go through the entire index to find out.
The second pointer is to look at each entry that is indicated for me, go through it in reverse chronological order and check if my friends have it. This is pretty straight forward, and the problem here is that it has to do with a combination of using this: with '$ in', limit () and sort () on top, all together in one request.
At this point, I am considering solutions for combining results from the application side, but splitting "$" or "on the application side" is easy, but how to split "$ in" into an array of criteria ('owner' => array ('$ in' => $ friends), 'perm.type' => array ('$ in' => array ('P', 'F')))?