Find records in which array A contains at least X values ​​from array B

I have a collection of objects, each of which has a field called fingerprint that contains 20 hashes:

{ title: 'The Chronicles of Narnia', authors: ['CS Lewis'], fingerprint: ['50e...', 'ae2...', ...] } 

Then I have a fingerprint request for another 20 hashes. What I would like to do is find all the records that contain at least X hashes. In other words, the intersection of two arrays must be of a certain size.

I have an old implementation of a similar system that uses MySQL. There the query looks something like this:

 SELECT * FROM Document d INNER JOIN Fingerprint f ON d.id = f.document_id WHERE f.whorl IN (:hashes) GROUP BY d.id HAVING COUNT(d.id) >= X 

Each entry in the Fingerprint table contains a document identifier and one fingerprint screw. Fingerprint will have 20 entries for each document.

As far as I understand, this is what this query does - this is a duplication of the document every time the screw matches and then is grouped by unique documents. It seems a little wasteful, but it works.

I am trying to re-implement this system in MongoDB but I have no luck. I can get a list of all entries that contain at least one or all curls:

 at least one: db.objects.find({ fingerprint: {$in: [hashes]}) all: db.objects.find({ fingerprint: {$all: [hashes]}) 

And I understand that I could scan this list at the application level to find the matches that I need. If I expect millions of objects (currently about 1.5 million), then this seems bad.

I looked at the functionality of aggregate() , but cannot improve what I already have:

 db.objects.aggregate({$match: {fingerprint: {$in: [hashes]}}}) 

From here, I thought I could group and filter:

 db.objects.aggregate({$match: {fingerprint: {$in: [hashes]}}}, {$group: {_id: "$_id", matches: {$sum: 1}}}) 

Here I tried to replicate what MySQL did: for each match it emits a document and then counts the documents. Of course, here we publish the document only once, no matter how many matches there are.

Then I thought that $unwind matched list, but each time produces 20 documents.

Ideally, there is a $some operator that I could use as follows:

 db.objects.find(fingerprint: {$some: {from: [hashes], count: X}}) 

Is something like this possible and effective? I would like to be able to run these queries in response to a user search, so I think MapReduce is out of the question?

thanks

+4
source share
1 answer

It’s actually quite simple to do what you want using the aggregation structure. I am sure that you can clarify the following to do exactly what you need:

 db.objects.aggregate([ {$unwind : "$fingerprint" }, {$match : {fingerprint : {$in: [hashes] } } }, {$group : {_id:"$title", numMatches: {$sum:1} } }, {$match : {numMatches : {$gt: X} } } ]) 
+5
source

All Articles