How to read a collection in pieces of 1000?

I need to read a whole collection from MongoDB (collection name is "test") in Python code. I tried both

self.__connection__ = Connection('localhost',27017) dbh = self.__connection__['test_db'] collection = dbh['test'] 

How to read a collection in pieces of 1000 (to avoid memory overflow, because the collection can be very large)?

+9
source share
6 answers

I agree with Remon, but you are talking about batches of 1000, which his answer really does not cover. You can set the batch size in the cursor:

 cursor.batch_size(1000); 

You can also skip entries, for example:

 cursor.skip(4000); 

Is this what you are looking for? This is actually a pagination template. However, if you are just trying to avoid running out of memory, you do not need to set the batch size or skip.

+6
source

Use cursors. Cursors have a variable "batchSize", which determines how many documents are actually sent to the client for each batch after the request. You do not need to touch this setting, though, since by default everything is in order, and the complexity of calling the "getmore" commands is hidden from you in most drivers. I am not familiar with pymongo, but it works as follows:

 cursor = db.col.find() // Get everything! while(cursor.hasNext()) { /* This will use the documents already fetched and if it runs out of documents in it local batch it will fetch another X of them from the server (where X is batchSize). */ document = cursor.next(); // Do your magic here } 
+5
source

Inspired by @Rafael Valero + fixing the last error in my code and giving it a more general character, I created a generator function to iterate over the Mongo collection with the query and projection:

 def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}): chunks = range(start_from, collection.find(query).count(), int(chunksize)) num_chunks = len(chunks) for i in range(1,num_chunks+1): if i < num_chunks: yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]] else: yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop] 

so for example, you first create an iterator like this:

 mess_chunk_iter = iterate_by_chunks(db_local.conversation_messages, 200, 0, query={}, projection=projection) 

and then repeat it in pieces:

 chunk_n=0 total_docs=0 for docs in mess_chunk_iter: chunk_n=chunk_n+1 chunk_len = 0 for d in docs: chunk_len=chunk_len+1 total_docs=total_docs+1 print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}') print("total docs iterated: ", total_docs) chunk #: 1, chunk_len: 400 chunk #: 2, chunk_len: 400 chunk #: 3, chunk_len: 400 chunk #: 4, chunk_len: 400 chunk #: 5, chunk_len: 400 chunk #: 6, chunk_len: 400 chunk #: 7, chunk_len: 281 total docs iterated: 2681 
+3
source

Here is a general solution for iterating over any iterator or generator over packets:

 def _as_batch(cursor, batch_size=50): # iterate over something (pymongo cursor, generator, ...) by batch. # Note: the last batch may contain less than batch_size elements. batch = [] try: while True: for _ in range(batch_size): batch.append(next(cursor)) yield batch batch = [] except StopIteration as e: if len(batch): yield batch 

This will work as long as the cursor defines the __next__ method (that is, we can use next(cursor) ). Thus, we can use it both for the raw cursor and for the transformed records.

Examples

Simple use:

 for batch in db['coll_name'].find(): # do stuff 

More complicated use (e.g. useful for bulk updates):

 def update_func(doc): # dummy transform function doc['y'] = doc['x'] + 1 return doc query = (update_func(doc) for doc in db['coll_name'].find()) for batch in _as_batch(query): # do stuff 

Reimplementing the count() function:

 sum(map(len, _as_batch( db['coll_name'].find() ))) 
+1
source

To create the original connection currently in Python 2 using Pymongo:

 host = 'localhost' port = 27017 db_name = 'test_db' collection_name = 'test' 

To connect using MongoClient

 # Connect to MongoDB client = MongoClient(host=host, port=port) # Make a query to the specific DB and Collection dbh = client[dbname] collection = dbh[collection_name] 

So the correct answer. I want to read using chunks (in this case, size 1000).

 chunksize = 1000 

For example, we can decide how many chunksize pieces we want.

 # Some variables to create the chunks skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize)) if len(skips_variable)<=1: skips_variable = [0,len(skips_variable)] 

Then we can extract each fragment.

 for i in range(1,len(skips_variable)): # Expand the cursor and retrieve data data_from_chunk = dbh[collection_name].find(query)[skips_variable[i-1]:skips_variable[i]])) 

If the query in this case is equal to query = {} .

Here I use similar ideas to create dataframes from MongoDB. Here I use something similar to writing in MongoDB in chunks.

Hope this helps.

0
source

Sorry for the URL you provided, but I believe it is elegantly resolved: http://code.activestate.com/recipes/137270-use-generators-for-fetching-large-db-record-sets/

-2
source

All Articles