I work on a project in which we periodically collect a large amount of e-mail through IMAP or POP, perform analysis on it (for example, clustering in conversations, extracting important sentences, etc.), and then presenting the presentation via the Internet to the end user.
The main view will be a profile page, similar to facebook, for each contact of the last (approximately 20) conversations that each of them received from the email we received.
It is important for us to be able to quickly and quickly get the profile page and the last 20 elements. We can also often insert the latest emails in this channel. For this, document storage and low-cost atomic records of MongoDB seem quite attractive.
However, we will also have a large volume of old emails that will not be often available (since they will not appear in the last 20 elements, people will only see them if they look for them, which will be relatively rare). In addition, the size of this data will grow faster than the contact store over time.
From what I read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way around this is to use virtual memory, which can incur significant overhead. In particular, if Mongo cannot distinguish between volatile data (profiles / channels) and non-volatile data (old emails), this can turn out to be rather unpleasant (and, since it apparently transfers the allocation of virtual memory to the OS, I don’t understand how it would be possible for Mongo).
It would seem that the only choice is to either (a) buy a sufficient amount of RAM to store everything that is good for volatile data, but it is hardly cost-effective to capture TB e-mail messages or (b) use virtual memory and see reading / recording on our volatile data with slow scanning.
Is this right, or am I missing something? Would MongoDB work well for this particular problem? If so, what will be the configuration?
database mongodb database-design storage
Andrew J
source share