Mongoengine is very slow on large documents compared to using native pimongo

Question

Mongoengine is very slow on large documents compared to using native pimongo

I have the following mongoengine model:

class MyModel(Document): date = DateTimeField(required = True) data_dict_1 = DictField(required = False) data_dict_2 = DictField(required = True)

In some cases, a document in the database can be very large (about 5-10 MB), and the data_dict fields contain complex nested documents (dict lists, dicts, etc.).

I ran into two (possibly related) issues:

When I run my own pymongo find_one () query, it returns within a second. When I run MyModel.objects.first (), it takes 5-10 seconds.
When I request one large document from the database and then access its field, it takes 10-20 seconds to do the following:
```
 m = MyModel.objects.first() val = m.data_dict_1.get(some_key) 
```

The data in the object does not contain references to any other objects, so this is not a problem of dereferencing objects.
I suspect that this is due to some inefficiency of the internal representation of mongoengine data, which affects the construction of the document object, as well as access to the fields. Is there anything I can do to improve this?

+8

python mongodb pymongo mongoengine

Baruch oxman Feb 07 '16 at 18:24

source share

1 answer

Steve rossiter · Accepted Answer · 2016-02-08T16:52:08+0000

TL; DR: mongoengine spends age converting all returned arrays to dicts

To test this, I created a document collection with a DictField with a large nested dict . The document is approximately in your range of 5-10 MB.

Then we can use timeit.timeit to confirm the difference in readings using pymongo and mongoengine.

Then we can use pycallgraph and GraphViz to see what Mongong is for a long time.

Here is the complete code:

 import datetime import itertools import random import sys import timeit from collections import defaultdict import mongoengine as db from pycallgraph.output.graphviz import GraphvizOutput from pycallgraph.pycallgraph import PyCallGraph db.connect("test-dicts") class MyModel(db.Document): date = db.DateTimeField(required=True, default=datetime.date.today) data_dict_1 = db.DictField(required=False) MyModel.drop_collection() data_1 = ['foo', 'bar'] data_2 = ['spam', 'eggs', 'ham'] data_3 = ["subf{}".format(f) for f in range(5)] m = MyModel() tree = lambda: defaultdict(tree) # http://stackoverflow.com/a/19189366/3271558 data = tree() for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3): data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000)) m.data_dict_1 = data m.save() def pymongo_doc(): return db.connection.get_connection()["test-dicts"]['my_model'].find_one() def mongoengine_doc(): return MyModel.objects.first() if __name__ == '__main__': print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10))) print("mongoengine took", timeit.timeit(mongoengine_doc, number=10)) with PyCallGraph(output=GraphvizOutput()): mongoengine_doc()

And the result shows that mongoengine is very slow compared to pymongo:

 pymongo took 0.87s mongoengine took 25.81118331072267

The resulting call schedule clearly shows where the bottle’s throat is:

Essentially mongoengine will call the to_python method on each DictField that will be returned from db. to_python pretty slow, and in our example this is called an insane number of times.

Mongoengine is used to elegantly map your document structure for python objects. If you have very large unstructured documents (for which mongodb is great), then mongoengine is actually not the right tool, and you should just use pymongo.

However, if you know the structure, you can use the EmbeddedDocument fields to get slightly better performance from mongoengine. I executed a similar, but not equivalent test code in this value , but the output:

 pymongo with dict took 0.12s pymongo with embed took 0.12s mongoengine with dict took 4.3059175412661075 mongoengine with embed took 1.1639373211854682

So you can make mongoengine faster, but pymongo is much faster.

UPDATE

A good shortcut for the pymongo interface here is to use an aggregation structure:

 def mongoengine_agg_doc(): return list(MyModel.objects.aggregate({"$limit":1}))[0]

Mongoengine is very slow on large documents compared to using native pimongo

More articles: