Updating a large number of entries in a collection

Question

Updating a large number of entries in a collection

I have a collection called TimeSheet that has several thousand entries. This will eventually increase to 300 million records per year. In this collection, I insert several fields from another set called Department , which in most cases will not receive any updates and some records will rarely be updated. Rarely, I mean only once or twice a year, as well as not all records, only less than 1% of the records in the collection.

Basically, after creating a department, there will be no update, even if there is an update, it will be performed initially (when there are not many related records in TimeSheet)

Now, if someone updates the department in a year, in the worst case it is likely that the TimeSheet collection will contain about 300 million records and about 5 million matching records for the department that is being updated. The update request condition will be in the index field.

Since this update is time consuming and creates locks, I wonder if there is a better way to do this? One of the options I'm thinking of is to run an update request in packages by adding an additional condition, such as UpdatedDateTime> somedate && UpdatedDateTime < somedate .

Other information:

One document can be about 3 or 4 KB in size. We have a replica set containing three replicas.

Is there any other better way to do this? What do you think of this kind of design? What do you think if the numbers I gave are less like the ones below?

1) 100 million total records and 100,000 matching records for update request

2) 10 million total records and 10,000 matching records for update request

3) 1 million shared records and 1000 matching records for update request

Note. The names of the Department and TimeSheet , and their purpose are fictitious, not real sets, but the statistics I gave are correct.

+7

mongodb

Anand dayalan Aug 24 '13 at 6:37

source share

1 answer

Pierre-louis gottfrois · Accepted Answer · 2013-08-24T09:33:42+0000

Let me give you a couple of tips based on my global knowledge and experience:

Use shorter field names

MongoDB saves the same key for each document. This repetition causes an increase in disk space. This can lead to some performance issue in a very large database like yours.

Pros:

Smaller documents, therefore less disk space
More documennt for installation in RAM (more caching)
The size of the do indexes will be smaller in some scenario

Minuses:

Less readable names

Index Size Optimization

The smaller the size of the index, the more it will fit in RAM and the fewer gaps in the index. Consider, for example, the SHA1 hash for git. A git commit is represented many times by the first 5-6 characters. Then just save 5-6 characters instead of the entire hash.

Understand duty cycle

For updates occurring in the document, which leads to the transfer of an expensive document. This document moves, which will delete the old document and update it to a new empty space and update indexes, which are expensive.

We need to make sure that the document does not move if there is some kind of update. There is a filling factor for each collection, which during document entry indicates how much extra space should be allocated separately from the actual size of the document.

You can see the fill factor of the collection using:

 db.collection.stats().paddingFactor

Manual indentation

In your case, you will probably start with a small document that will grow. Updating your document after this will result in several document movements. Therefore, it is better to add an addition to the document. Unfortunately, there is no easy way to add an add-on. We can do this by adding some random bytes to some key when performing the insert, and then delete this key in the next update request.

Finally, if you are sure that some keys to documents will appear in the future, provide these keys with some default values so that future updates do not increase the size of the document causing the document to move.

You can get information about the request causing the document to move:

 db.system.profile.find({ moved: { $exists : true } })

A large number of collections VS a large number of documents in several collections

A schema is what depends on the requirements of the application. If there is a huge collection in which we request only the last N days of data, then we can choose to select a separate collection, and old data can be safely archived. This will ensure that caching is performed correctly in RAM.

Each created collection carries costs that are greater than the costs of creating the collection. Each of the collections has a minimum size of several kilobytes + one index (8 KB). Each collection has an associated namespace, by default we have 24K namespaces. For example, having a collection for the user is a poor choice because it does not scale. After some time, Mongo will not allow us to create new collections of indexes.

As a rule, the presence of a large number of collections does not have a significant decrease in performance. For example, we can choose one collection per month if we know that we always request based on months.

Data Denormalization

It is always recommended that you store all related data for a query or query sequence in the same place on disk. You need something to duplicate information in different documents. For example, in a blog post, you want to keep comments for comments in a published document.

Pros:

The index size will be very smaller since the number of index entries will be less
the request will be very fast, which includes the collection of all the necessary details.
the size of the document will be comparable to the size of the page, which means that when we bring this data into RAM, most of the time we don’t give other data along the page.
moving the document ensures that we free the page, and not a small tiny fragment on the page that cannot be used in additional inserts

Cropped collections

The Capped collection behaves like circular buffers. They are special collections of fixed size. These collections can receive high-speed recordings and sequential readings. Being a fixed size, as soon as the allocated space is full, new documents are recorded by deleting older ones. However, document updates are only allowed if the updated document matches the size of the original document (play with the add-on for more flexibility).

Updating a large number of entries in a collection

Use shorter field names

Index Size Optimization

Understand duty cycle

Manual indentation

A large number of collections VS a large number of documents in several collections

Data Denormalization

Cropped collections

More articles: