Hadoop, Hive, Pig, HBase, Cassandra - when to use what?

First of all, I'm relatively new to Big Data and Hadoop, and I just started a little experimenting with the Hortonworks sandbox (Pig and Hive). I was wondering in which cases I could use the above tools Hadoop, Hive, Pig, HBase and Cassandra?

In my environment, sandboxes with a file of only 9 MB Hive and Pig had a response time of seconds to minutes. This is obviously not applicable in some situations, for example, for web applications (if it is not something else, for example, setting up my virtual machine).

My assumptions about the correct use:

  • Hadoop: just the technological base for the rest, only very few use cases where it will be used directly.
  • Beehive or Pig: for analytical processes that run once per hour or day.
  • HBase or Cassandra: for real-time applications (e.g. web applications) where response times of 100 ms or less are required

Also, when to use HBase as opposed to when to use Cassandra?

Thanks!

+8
cassandra hadoop hive apache-pig
source share
1 answer

Your guesses are somewhat accurate.

By Hadoop, I think you mean MapReduce? Hadoop as such is an ecosystem that consists of many components (including MapReduce, HDFS, Pig, and Hive).

MapReduce is good when you need to write data processing logic at the level of the Map () and Reduce () methods. In my work, I find MapReduce very useful when dealing with unstructured data and needs to be cleared.

Hive, Pig: they are good for batch processes that run periodically (possibly in terms of hours or days)

HBase and Cassandra: support for low latency calls. Thus, they can be used for real-time applications where response time is key. Take a look at this discussion to get a better understanding of HBase vs Cassandra.

+7
source share

All Articles