I am exploring options for organizing data storage for an Erlang application. The data he should use is basically a huge collection of binary drops indexed by short row identifiers. Each blob is less than 10 Kb, but there are many of them. I would expect that in general they will be up to 200 GB in size, so obviously it cannot fit into memory. A typical operation on this data is either to read the blob by its identifier, or to update the blob by its identifier, or add a new one. In each given period of time, only a subset of identifiers is used, so the performance of access to the data warehouse can benefit from the cache in memory. Speaking of performance, this is very important. The goal is to have about 500 read operations and 500 updates per second on commercial equipment (say, on the EC2 VM).
Any suggestions for use here? As far as I understand, dets is out of the question since it is limited to 2G (or is it 4G?). The Messiah probably did not understand; my impression is that it was mainly intended for cases where the data is suitable for memory. I am considering using the EDTK Berkeley DB driver to complete this task. Will this work in the above scenario? Does anyone have experience using it in production under similar conditions?
source share