Real-time query / aggregation of millions of records - hadoop? HBase? Cassandra?

Question

Real-time query / aggregation of millions of records - hadoop? HBase? Cassandra?

I have a solution that can be parallelized, but I have no experience with hadoop / nosql yet, and I'm not sure which solution is best for my needs. Theoretically, if I had unlimited processors, my results should have returned instantly. So any help would be appreciated. Thanks!

Here is what I have:

1000s datasets
Dataset classes:
- all datasets have the same keys
- 1 million keys (this may be later than 10 or 20 million)
dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numeric values for which we need to aggregate (avg, stddev and use R to calculate statistics).
- several columns are type_id columns, since in a particular query we may want to include only certain type_name types
web application
- the user can choose what data they are interested in (from 15 to 1000). The annex
- need to present: key and aggregated results (avg, stddev) of each column
data updates:
- entire dataset can be added, deleted or replaced / updated
- it would be great to add columns. But, if necessary, you can simply replace the entire data set.
- never add rows / keys to a dataset - so you don’t need a system with a lot of fast records
Infrastructure:
- there are currently two machines with 24 cores each
- end up wanting to also run this on amazon

I cannot precompute my aggregated values, but since each key is independent, it should be easily scalable. I currently have this data in a postgres database, where each data set is in its own section.

partitions are good because they can easily add / remove / replace database partitions
good for filtering based on type_id
Databases are not easy to write concurrent queries
Databases are good for structured data, and my data is not structured.

As a proof of concept, I tried hadoop:

created a separate file with delimiters in each data set for a particular id_type
uploaded to hdfs
map: retrieved value / column for each key
reduce: calculated mean and standard deviation

From my rough proof of concept, I see that it will scale well, but I see that hasoop / hdfs has latency, which I read that it is not usually used for real-time queries (although I am fine with returning results back to users via 5 second).

Any suggestion on how I should approach this? I was thinking about trying HBase to feel it. Should I take a look at Hive instead? Cassandra? Voldemort?

thanks!

+7

cassandra hbase nosql hadoop hive

anish Jul 26 '11 at 17:12

source share

5 answers

David · Answer 1 · 2011-07-26T22:10:01+0000

A hive or a pig does not seem to help you. In fact, each of them compiles to one or more tasks on the map / reduction, so the answer cannot be for 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I do not understand why you cannot pre-compute the summary statistics for each column. You should look for computational averages so you don't have to do heavy weight.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev (X) = sqrt (E [X ^ 2] - (E [X]) ^ 2)

this means that you can get stddev from AB by doing

SQRT (E [AB ^ 2] - (E [AB]) ^ 2). E [AB ^ 2] is (sum (A ^ 2) + sum (B ^ 2)) / (| A | + | B |)

Michael manoochehri · Answer 2 · 2012-01-27T04:55:46+0000

Since your data seems pretty homogeneous, I would definitely take a look at Google BigQuery - you can swallow and analyze the data without the MapReduce step (on your part), and the RESTful API will help you create a web application based on your requests. In fact, depending on how you want to design your application, you can create a real-time application.

David gruzman · Answer 3 · 2011-07-28T19:36:10+0000

This is a serious problem without an exceptionally good open source solution. In commercial space, MPP databases such as greenplum / netezza should be used. Ideally, you'll need Google Dremel (the engine behind BigQuery). We are developing an open source clone, but it will take some time ... Regardless of the engine used, I think that the solution should include storing the entire data set in memory - it should give an idea of what cluster size you need.

Arnon Rotem-Gal-Oz · Answer 4 · 2011-11-15T22:12:12+0000

If I understand you correctly, and you only need to fill in separate columns at a time, you can store your data in different ways to get better results in HBase, which will look something like this table in the data column in the current setting and another separate table for the fields filtering (type_ids) a row for each key in today's setting - you might think about how to include filter fields in the key for effective filtering - otherwise you will need to perform two-phase reading (a column for each table today In your previous setup (that is, several thousand columns), HBase does not mind if you add new columns and are sparse in the sense that it does not store data for columns that do not exist. When you read a row, you will get all the corresponding values that you can do avg. etc quite easily

Virmundi · Answer 5 · 2012-03-15T13:33:46+0000

For this, you may need a simple old database. It doesn't seem like you have a transaction system. As a result, you can probably use only one or two large tables. SQL has problems when you need to combine big data. But since your dataset does not sound the way you need to join, you should be fine. You can customize indexes to find a dataset, and also perform math in SQL or in an application.

Real-time query / aggregation of millions of records - hadoop? HBase? Cassandra?

More articles: