I have a collection of objects, each of which has a field called fingerprint that contains 20 hashes:
{ title: 'The Chronicles of Narnia', authors: ['CS Lewis'], fingerprint: ['50e...', 'ae2...', ...] }
Then I have a fingerprint request for another 20 hashes. What I would like to do is find all the records that contain at least X hashes. In other words, the intersection of two arrays must be of a certain size.
I have an old implementation of a similar system that uses MySQL. There the query looks something like this:
SELECT * FROM Document d INNER JOIN Fingerprint f ON d.id = f.document_id WHERE f.whorl IN (:hashes) GROUP BY d.id HAVING COUNT(d.id) >= X
Each entry in the Fingerprint table contains a document identifier and one fingerprint screw. Fingerprint will have 20 entries for each document.
As far as I understand, this is what this query does - this is a duplication of the document every time the screw matches and then is grouped by unique documents. It seems a little wasteful, but it works.
I am trying to re-implement this system in MongoDB but I have no luck. I can get a list of all entries that contain at least one or all curls:
at least one: db.objects.find({ fingerprint: {$in: [hashes]}) all: db.objects.find({ fingerprint: {$all: [hashes]})
And I understand that I could scan this list at the application level to find the matches that I need. If I expect millions of objects (currently about 1.5 million), then this seems bad.
I looked at the functionality of aggregate() , but cannot improve what I already have:
db.objects.aggregate({$match: {fingerprint: {$in: [hashes]}}})
From here, I thought I could group and filter:
db.objects.aggregate({$match: {fingerprint: {$in: [hashes]}}}, {$group: {_id: "$_id", matches: {$sum: 1}}})
Here I tried to replicate what MySQL did: for each match it emits a document and then counts the documents. Of course, here we publish the document only once, no matter how many matches there are.
Then I thought that $unwind matched list, but each time produces 20 documents.
Ideally, there is a $some operator that I could use as follows:
db.objects.find(fingerprint: {$some: {from: [hashes], count: X}})
Is something like this possible and effective? I would like to be able to run these queries in response to a user search, so I think MapReduce is out of the question?
thanks