Why Hadoop is not a real-time platform

I just started learning Hadoop and went through some sites, and I often found that

"Hadoop is not a real-time platform" even in SO also

I am dealing with this, and I really cannot understand about it. Can someone help me and explain to me about this?

Thanks everyone

+7
hadoop real-time
source share
1 answer

Hadoop was originally designed for batch processing. This means that immediately take a large data set into the input data, process it and write a large output. The MapReduce concept itself is batch-oriented rather than real-time. But honestly, this was only at the beginning of Hadoop, and now you have many opportunities to use Hadoop in more real mode.

At first, I find it important to determine what you mean in real time. You may be interested in streaming processing, or you can also run queries on your data that return results in real time.

For processing a stream on Hadoop, initially Hadoop will not provide you with such capabilities, but you can easily integrate some other projects with Hadoop:

  • Storm-YARN allows you to use Storm on your Hadoop cluster through YARN.
  • Spark integrates with HDFS so you can process streaming data in real time.

For real-time queries, there are also several projects that use Hadoop:

  • Impala from Cloudera uses HDFS, but bypasses MapReduce in general, because otherwise there is too much overhead.
  • Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
  • The Stinger project aims to make it more real.

Perhaps there are other projects that fit into the list of "Creating Hadoop in real time," but these are the most famous of them.

So, as you can see, Hadoop is moving more and more in the real-time direction, and even if it is not designed for this, you have many options for expanding it in real time.

+20
source share

All Articles