I have a solution that can be parallelized, but I have no experience with hadoop / nosql yet, and I'm not sure which solution is best for my needs. Theoretically, if I had unlimited processors, my results should have returned instantly. So any help would be appreciated. Thanks!
Here is what I have:
- 1000s datasets
- Dataset classes:
- all datasets have the same keys
- 1 million keys (this may be later than 10 or 20 million)
- dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numeric values ββfor which we need to aggregate (avg, stddev and use R to calculate statistics).
- several columns are type_id columns, since in a particular query we may want to include only certain type_name types
- web application
- the user can choose what data they are interested in (from 15 to 1000). The annex
- need to present: key and aggregated results (avg, stddev) of each column
- data updates:
- entire dataset can be added, deleted or replaced / updated
- it would be great to add columns. But, if necessary, you can simply replace the entire data set.
- never add rows / keys to a dataset - so you donβt need a system with a lot of fast records
- Infrastructure:
- there are currently two machines with 24 cores each
- end up wanting to also run this on amazon
I cannot precompute my aggregated values, but since each key is independent, it should be easily scalable. I currently have this data in a postgres database, where each data set is in its own section.
- partitions are good because they can easily add / remove / replace database partitions
- good for filtering based on type_id
- Databases are not easy to write concurrent queries
- Databases are good for structured data, and my data is not structured.
As a proof of concept, I tried hadoop:
- created a separate file with delimiters in each data set for a particular id_type
- uploaded to hdfs
- map: retrieved value / column for each key
- reduce: calculated mean and standard deviation
From my rough proof of concept, I see that it will scale well, but I see that hasoop / hdfs has latency, which I read that it is not usually used for real-time queries (although I am fine with returning results back to users via 5 second).
Any suggestion on how I should approach this? I was thinking about trying HBase to feel it. Should I take a look at Hive instead? Cassandra? Voldemort?
thanks!
anish
source share