Ruby on Rails / Merb as an interface for an application with billions of entries

I am looking for a backend solution for an application written in Ruby on Rails or Merb to process data with several billion records. I have a feeling that I have to go with a distributed model, and at the moment I was looking at

HBase with Hadoop

Couchdb

Problems with the HBase solution, as I see it - ruby ​​support is not very strong, and Couchdb has not yet reached version 1.0.

Do you have any suggestion that you would use for such a large amount of data?

The data will require fairly quick import, sometimes 30-40Mb at a time, but the import will be in pieces. Thus, ~ 95% of the time data will be read only.

+4
source share
5 answers

Depending on your actual data usage, MySQL or Postgres should be able to process several billion records on the right hardware. If you have a large volume of queries, both of these databases can be replicated on multiple servers (and reading replication is pretty easy to configure (compared to multiple master / write replicas).

The big advantage of using RDBMS with Rails or Merb is that you get access to all the excellent tool support for accessing these types of databases.

My advice is to actually profile your data on several of these systems and from there from there.

+1
source

Several different solutions were used there. In my experience, it really depends more on your usage patterns associated with this data, and not with more rows in the table.

For example, "How many inserts / updates per second occur." Such questions will help you decide which database solution you choose.

Take Google, for example: there really is no storage / search solution that suits their needs, so they created their own based on the Map / Reduce model.

+1
source

A word of warning about HBase and other projects of this nature (I don’t know anything about CouchDB - I think this is not dB at all, just a storage of key values):

  • Hbase is not configured for speed; It is configured for scalability. If speed of response is not a problem at all, run some proof of concept before taking this path.
  • Hbase does not support connections. If you use ActiveRecord and have more than one relationship ... well, you can see where this is going.

The Hive project, also built on top of Hadoop, supports connections; Pig does the same (but it's not really sql). Paragraph 1 applies to both. They are for heavy data processing tasks, and not for the type of processing you are likely to do with Rails.

If you want scalability for a web application, basically the only strategy that works is to partition your data and do as much as possible to isolate partitions (no need to talk to each other). This is a bit complicated with Rails, as it is assumed by default that there is one central database. Perhaps there were improvements on this front, as I looked at the problem about a year and a half ago. If you can share your data, you can scale horizontally wide enough. A single MySQL machine can process several million rows (PostgreSQL can probably scale to more rows, but it can run a little slower).

Another strategy that works is to set up the master-slave, where all the records are performed by the master, and the reading is shared between subordinates (and possibly with the master). Obviously, this must be done quite carefully! Assuming a high read / write ratio, it can scale very well.

If your organization has deep pockets, check out what Vertica, AsterData and Greenplum have to offer.

+1
source

The backend will depend on the data and how the data will be available.

But for ORM, I would most likely use DataMapper and write a custom DataObjects adapter to access what you have selected.

0
source

I'm not sure that CouchDB, not being in 1.0, has something to do with it. I would recommend doing some testing with him (just creating a billion random documents) and see if he lingers. I would say that this will happen despite the absence of a specific version number.

CouchDB will help you a lot when it comes to splitting / outlining your data and it looks like it might fit your project - especially if the data format may change in the future (adding or removing fields) from the CouchDB database do not have a schema.

CouchDB has many optimizations for high reading applications, and based on my experience with it, where it really shines.

0
source

All Articles