How MonogoDB stacks up for very large datasets, where only some data is volatile

Question

How MonogoDB stacks up for very large datasets, where only some data is volatile

I work on a project in which we periodically collect a large amount of e-mail through IMAP or POP, perform analysis on it (for example, clustering in conversations, extracting important sentences, etc.), and then presenting the presentation via the Internet to the end user.

The main view will be a profile page, similar to facebook, for each contact of the last (approximately 20) conversations that each of them received from the email we received.

It is important for us to be able to quickly and quickly get the profile page and the last 20 elements. We can also often insert the latest emails in this channel. For this, document storage and low-cost atomic records of MongoDB seem quite attractive.

However, we will also have a large volume of old emails that will not be often available (since they will not appear in the last 20 elements, people will only see them if they look for them, which will be relatively rare). In addition, the size of this data will grow faster than the contact store over time.

From what I read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way around this is to use virtual memory, which can incur significant overhead. In particular, if Mongo cannot distinguish between volatile data (profiles / channels) and non-volatile data (old emails), this can turn out to be rather unpleasant (and, since it apparently transfers the allocation of virtual memory to the OS, I don’t understand how it would be possible for Mongo).

It would seem that the only choice is to either (a) buy a sufficient amount of RAM to store everything that is good for volatile data, but it is hardly cost-effective to capture TB e-mail messages or (b) use virtual memory and see reading / recording on our volatile data with slow scanning.

Is this right, or am I missing something? Would MongoDB work well for this particular problem? If so, what will be the configuration?

+8

database mongodb database-design storage

Andrew J Feb 04 '11 at 0:04

source share

4 answers

MongoDB does n’t "keep the entire data set in RAM." See http://www.mongodb.org/display/DOCS/Caching for an explanation of why / how it uses virtual memory the way it does.

That would be good for this application. If your sorting and filtering were more complex, you might, for example, want to use the “Zoom Out” operation to create a collection that is “ready to show,” but existing indexes will work just fine for a simple setup date.

+3

Ian mercer Feb 04 '11 at 1:53

source share

@Andrew J Usually RAM is enough to store the working set, this is true for MongoDB, as well as for RDBMS. Therefore, if you want to save the last 20 letters for all users without having to go to disk, you will need so much memory. If this exceeds the amount of memory in one system, you can use the MongoDB sharding function to distribute data to multiple computers, so they combine the memory, processor, and bandwidth of IO machines in the cluster.

@mP MongoDB allows application developers to specify the durability of your records, from one node in memory to several nodes on disk. The choice depends on your needs and how important the data is; not all data is created the same way. In addition, in MongoDB 1.8 you can specify - dur , this writes a log file for all entries. This further improves the durability of the recording and speeds up recovery if there is a failure.

+1

user602502 Feb 04 '11 at 1:31

source share

And what happens if your computer crashes by the whole moment that Mongo had in mind. I guess he has no magazines, so the answer is probably not lucky.

-7

mP. Feb 04 '11 at 0:23

source share

Bernie hackett · Accepted Answer · 2011-02-04T01:49:08+0000

MongoDB uses mmap to map documents into virtual memory (rather than physical RAM). Mongo does not require the entire data set to be in RAM, but you want your “working set” to be in memory (the working set should be a subset of your entire data set).

If you want to avoid matching large volumes of email into virtual memory, you can include an ObjectIds array in your profile document that refers to emails stored in a separate collection.

How MonogoDB stacks up for very large datasets, where only some data is volatile

More articles: