How to get random single document from 1 billion documents in mongoDB using python?

I want to get one random document from the mongoDB collection. Now my mongoDB collection contains over 1 billion collections. How to get a single random document from this collection?

+8
python mongodb pymongo
source share
5 answers

Add an extra column named random to your collection and make the value in it be between 0 and 1. You can assign random floating points from 0 to 1 to this column for each record through [random.random() for _ in range(0, 10)] .

Then: -

 import random collection = mongodb["collection_name"] rand = random.random() # rand will be a floating point between 0 to 1. random_record = collection.find_one({ 'random' => { '$gte' => rand } }) 

MongoDB will have its own implementation over time. Featured here - https://jira.mongodb.org/browse/SERVER-533

Not yet implemented at the time of writing.

+6
source share

I have never worked with MongoDB with Python, but there is a general solution to your problem. This is where the MongoDB script shell is located to receive a single random document:

 N = db.collection.count(condition) db.collection.find(condition).limit(1).skip(Math.floor(Math.random()*N)) 

condition here is the MongoDB request. If you want to query the entire collection, use query = null .

This is a general solution, so it works with any MongoDB driver.


Update

I conducted a test to test several implementations. First, I created a test collection with documents 5567249 with an indexed random field rnd .

I chose three methods to compare with each other:

First method:

 db.collection.find().limit(1).skip(Math.floor(Math.random()*N)) 

Second method:

 db.collection.find({rnd: {$gte: Math.random()}}).sort({rnd:1}).limit(1) 

Third method:

 db.collection.findOne({rnd: {$gte: Math.random()}}) 

I ran each method 10 times and got its average computational time:

 method 1: 882.1 msec method 2: 1.2 msec method 3: 0.6 msec 

This test shows that my solution is not the fastest.

But the third solution is also not very good, because it finds the first element in the database (sorted in natural order ) using rnd > random() . Thus, his conclusion is not truly random.

I think the second method is the best for frequent use. But it has one drawback: it requires changing the entire database and providing an additional index.

+21
source share

Since MongoDB 3.2 , this can be done using the aggregate function with the $sample operator, as described in docs . It is super fast. The following code will randomly select 20 documents from the collection.

 db.collection.aggregate( [ { $sample: {size: 20} } ] ) 

if you need to select random documents with specific criteria, you can use it with $match opperator

 db.collection.aggregate([ { $sample: {size: 20} }, { $match:{"yourField": value} } ]) 

Beware of the order! If I search about 100 thousand documents in my small database, this command takes 15 ms higher, and when switching the order - 1750 ms (more than 100 times slower). Of course, the reason is obvious. In addition, with this order you get a subset of these random 20 documents ...

+5
source share

In the performance? It's hard, to say the least, without changing your data.

Imagine you are trying to get rand () from 1,000,000 from 1b documents. It will be slow, very slow. This is because MongoDB does not use indexes effectively when skipping.

As @Calvin said, MongoDB has a function request to receive random documents, however it has not yet been implemented.

The most effective way to do this if you would do it regularly is to add an automatically increasing identifier to your posts: http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field and use this for rand() on.

Edit

To clarify; when using an automatic incremental identifier, you must first execute one request (if you do not track it in another way) in order to get the maximum field value. You can request a collection of counters or your own collection and sort it in reverse order ( sort({field:-1}) ) and limit(1) to get the maximum value for rand() .

You also need to consider changes in the data, which means that you really want $gte this random position.

My idea can be explained here in more detail: php mongodb find the nth element in the collection

+2
source share

If your objects have an int id, you can do something like

 findOne({id: {$gte: rand()}}) 
+1
source share

All Articles