Large-scale storage for attached documents?

I need to store hundreds of thousands (now, possibly many millions) of documents that start with blanks and are added to frequent ones, but never updated otherwise or deleted. These documents are in no way related to each other, and they just need to access a unique identifier.

Access reading is a subset of a document that almost always starts halfway through some indexed location (for example, “document No. 4324319, save # 53 to the end”).

These documents begin very small, in a few KB. Usually they reach a final size of about 500 KB, but many reach 10 MB or more.

I am currently using MySQL (InnoDB) to store these documents. Each of incremental saves is simply dumped into one large table with the identifier of the document to which it belongs, so reading the part of the document looks like this: "select * from save where document_id = 14 and save_id> 53 order by save_id", then manually concatenate everything together in code.

Ideally, I would like the storage solution to be easily scalable horizontally, with redundancy on all servers (for example, every document stored on at least 3 nodes) with easy recovery of broken servers.

I saw CouchDB and MongoDB as possible replacements for MySQL, but I'm not sure if any of them make a lot of sense for this particular application, although I'm open to persuasion.

Any input on a good data storage solution?

+6
database mongodb couchdb storage
source share
5 answers

Sounds like the perfect problem HBase needs to solve (over HDFS).

The disadvantage is, in particular, a somewhat steep learning curve.

+1
source share

Is there a reason you need a database?

You are describing a "document storage system with unique names", so I started thinking about a "file system". Perhaps something like an enterprise-class file server / s (I rated a maximum of about 200 tibyte data), where the unique identifier is the directory and file name on the network.

0
source share

I immediately thought, why store them in a database? Does their database retain better search performance than the file system when working with so many files?

I would think that storing them in a file system in a hashed directory structure would be better. You can use the database to store only metadata (root directories, document identifier, save identifier, location relative to the root).

The root directories (nodes) will be a separate table and can be used for reading (listing and writing to all locations), and then cyclic (or another load balancing algorithm) for reading.

If node is not available or the file does not exist, load balancing may fail until the next line. Root directories can also be marked offline for scheduled outages if it has read / write code. The same could be used for partitioning, where x the number of root directories serves odd identifiers, and the number x serves as a simple identifier as a simple example.

Ensuring node synchronization can be encoded using metadata.

Only my 2 cents, since I had never dealt with this volume of files before.

0
source share

OK, first a reservation, MongoDB has a document size limit. However, the newest version will cover your size of 10 MB.

So, some useful points for MongoDB .

Ideally, I would like the storage solution to be easily scalable horizontally, with redundancy on all servers (for example, every document stored on at least 3 nodes) with easy recovery of broken servers.

For replication, MongoDB supports a set of replicas . Replica sets - replicas with one master. If the wizard goes down, the system automatically selects a new wizard (easy recovery). Adding a new node is as simple as starting a new server and pointing to an existing set.

For horizontal scalability, MongoDB supports sharding . Sharding is a bit more complicated, but it works as you would expect, breaking records on multiple machines (or multiple sets of replicas).

I need to store hundreds of thousands (now, potentially many millions) of documents that start with blanks and are added to frequent

In several companies, Mongo works with billions of documents in production.

Mongo provides a series of modification modifiers that are very useful in the case of "add". In particular, check the $ push statement, which adds the end of the array. It should be exactly what you need.

Access reading is a subset of a document that almost always starts halfway through some indexed location (for example, “document No. 4324319, save # 53 to the end”).

MongoDB allows you to return only selected fields (as expected). Depending on your layout, you can use dot notation to retrieve only specific subdocuments. If your updates are implemented as arrays, you can also use the $ slice command, which is well suited for the request above.

Therefore, I think MongoDB meets all your basic needs. Easy to add, easy to request for these add-ons and replicate. You get horizontal scaling with sharding (try starting from replica again)

0
source share

Check out the SolFS virtual file system. It will work well in your environment.

0
source share

All Articles