MongoDB schema design - Many small documents or fewer large documents?

Background
I am prototyping a conversion from our RDBMS database to MongoDB. Despite the denormalization, it seems that I have two options, one of which leads to many (millions) smaller documents or to what leads to smaller (hundreds of thousands) large documents.

If I could translate it into a simple analogue, that would be the difference between a collection with fewer client documents like this (in Java):

 class Customer {
     private String name;
     private address address;
     // each CreditCard has hundreds of Payment instances
     private Set <CreditCard> creditCards;
 }

or a collection with many, many payment documents, such as:

 class Payment {
     private customer customer;
     private CreditCard creditCard;
     private Date payDate;
     private float payAmount;
 }

Question
Is MongoDB preferred for many, many small documents or smaller documents? The answer depends mainly on what queries I plan to run? (i.e. how many credit cards does client X have?). What is the average amount of all customers paid last month?)

I looked around a lot, but I did not stumble upon the best MongoDB schema methods that would help me answer my question.

+66
mongodb database-design schema
Jun 14 2018-10-06
source share
3 answers

You will definitely need to optimize the queries you make.

Here is my best guess based on your description.

You probably want to know all the credit cards for each Client, so save an array of Customer objects. You probably also want to get a Customer link for each payment. This will result in the payment document being relatively small.

The Payment object will automatically receive its own identifier and index. You probably want to add an index to the Customer link as well.

This will allow you to quickly search for payments by the Client without saving the entire client’s object each time.

If you want to answer questions such as “What is the average total of all customers paid last month,” you will need a card / reduction for any significant data set instead. You do not receive this answer in real time. You will find that maintaining a “link” to the Client is probably good enough for these map abbreviations.

So, to answer your question directly: Is MongoDB designed to prefer many, many small documents or fewer large documents?

MongoDB is designed to quickly find indexed records. MongoDB very well finds several needles in a large haystack. MongoDB is not very good at most needles in a haystack. Therefore, create your data around the most common use cases and write maps / reduce tasks for rarer use cases.

+68
Jun 22 '10 at 4:10
source share

According to MongoDB's own documentation, it looks like it is intended for many small documents.

From Top Recommendations for MongoDB :

The maximum size of documents in MongoDB is 16 MB. In practice, most documents are a few kilobytes or less. Consider documents that look like rows in a table than tables themselves. Instead of maintaining lists of records in one document, instead make each record a document.

Of the 6 thumb rules for MongoDB schema schema: part 1 :

One-on-one modeling

An example of a one-to-few approach would be an address for a person. This is a good implementation option - youd puts the addresses in an array inside your Person object.

One to many

An example of a one-to-many relationship can be parts for a product in a spare part system. Each product can have up to several hundred spare parts, but not more than a couple thousand or so. This is a good use case for references - you specified ObjectIDs from a part in an array in a product document.

Individual Squillions

An example of one-to-squillions is an event logging system that collects log messages for different machines. Any given host can generate enough messages to overflow the size of a 16 MB document, even if everything you saved in the array was an ObjectID. This is the classic use case for “parent links” - you have a document for the host, and then save the ObjectID of the host in the documents for the log message.

+8
May 13 '16 at 18:14
source share

Documents that increase significantly over time can be ticking time bombs. Network bandwidth and RAM usage are likely to become measurable bottlenecks, forcing you to start all over again.

First, consider two collections: Customer and Payment. Thus, the grain is quite small: one document per payment.

Then you have to decide how to model your account information, such as credit cards. Let's look at whether client documents contain arrays of account information or whether you need a new collection of accounts.

If the documents of the account are separated from the documents of the client, the loading of all the accounts of one client into memory requires the selection of several documents. This can lead to additional memory, I / O, bandwidth, and CPU usage. Does this mean that a collection of accounts is a bad idea?

Your decision affects payment documents. If invoice information is embedded in a customer document, how would you link to it? Individual account documents have their own _id attribute. With built-in account information, your application will generate new identifiers for accounts or use account attributes (for example, account number) for a key.

Can a payment document actually contain all payments made in fixed time frames (for example, a day?). Such complexity will affect the entire code that reads and writes payment documents. Premature optimization can be fatal for projects.

Like account documents, payments are easily referenced if the payment document contains only one payment. For example, a new document type, such as a loan, may refer to a payment. But could you create a credit collection or insert credit information in your billing information? What happens if you later need to apply for a loan?

To summarize, I was successful in a large number of small documents and many collections. I implement links with _id and only with _id. Thus, I am not worried about ever-growing documents destroying my application. A schema is easy to understand and index, because each object has its own collection. Important objects are not hidden inside other documents.

I would like to hear your findings. Good luck

+5
Apr 18 '14 at 19:25
source share



All Articles