POC for Hadoop in real time

I have a problem. I want to learn about Hadoop and how I can use it to process real-time data streams. Therefore, I want to create a significant POC around him, so that I can demonstrate it when I have to prove my knowledge about this to a potential employer or represent it in my current company.

I would also like to mention that I am limited by hardware resources. Just my laptop and me :) I know the basics of Hadoop and wrote 2-3 basic tasks of MR. I want to do something more meaningful or real.

Please offer.

Thanks in advance.

+6
source share
7 answers

I would like to point out a few things.

If you want to make POC with just one laptop, it makes little sense to use Hadoop.

In addition, other people claim that Hadoop is not intended for real-time applications, because there is some overhead when starting Map / Reduce jobs.

Saying this, Cloudera released Impala , which works with the Hadoop ecosystem (in particular, the hive metastat) to achieve real-time performance. Keep in mind that to achieve this, it does not create Map / Reduce jobs and is currently in beta testing mode, so use it carefully.

So I would really recommend going to Impala so you can still use the Hadoop ecosystem, but if you are also considering alternatives, here are a few other frameworks that might be useful:

  • Druid : MetaMarkets was opened. It looks interesting, although I myself have not used it.
  • Storm : there is no integration with HDFS, it just processes the data as it appears.
  • HStreaming : integrates with Hadoop.
  • Yahoo S4 : seems pretty close to Storm.

In the end, I think you should really analyze your needs and see what you need using Hadoop, because it only runs in real time. There are several more projects that could help you achieve real-time performance.


If you need project ideas for the show, I suggest looking at this link . Here are some examples:

  • Finance / Insurance
    • Classify investment opportunities as good or not. based on industry / company performance, portfolio diversity, and currency risk.
    • Classify credit card transactions as valid or invalid, for example. location of the transaction holder and credit card, date, amount, purchased product or service, transaction history and similar transactions.
  • Biology / Medicine
    • Classification of proteins into structural or functional classes
    • Diagnostic classification, for example. image-based cancers
  • the Internet
    • Classification and rating of documents
    • Malware classification, email / tweet / web spam.
  • Production systems (e.g. in the energy or petrochemical industries)
    • Classify and detect situations (e.g. weaknesses or risk situations) based on real-time data and historical sensor data
+9
source

If you want your hands to be dirty in a highly promising streaming infrastructure, try streaming BDAS SPARK. Attention, this has not yet been released, but you can play on your laptop with the github version ( https://github.com/mesos/spark/tree/streaming ) There are many examples to get you started.

In addition, it has many advantages over existing infrastructures, 1. It gives you the opportunity to combine real-time calculations and batches in one stack 2. It will give you a REPL where you can try your special queries interactively. 3. You can run this on your laptop in local mode. There are many other benefits, but these three, I think, are enough for your need to get started.

You may need to learn scala to try out REPL: - (

For more information, visit http://spark-project.org/

+3
source

Hadoop is a high-performance oriented platform suitable for batch processes. If you're interested in processing and analyzing huge datasets in real time, take a look at Twitter Storm.

+1
source

One of the cool problems that I'm sure is the most real than anything else. Options trading. The key here is to follow the news, trends on Twitter, facebook, youtube, and then identify candidates for possible PUT or CALL. You will need good craftsmanship and a well-thought-out implementation of Mahout with Nutch / Lucene, and then use the trend data to understand the current situation, and the system should recommend rates (options).

+1
source

I am clearly biased, but I would also recommend looking at GridGain for something in real time. GridGain is a memory data platform that provides NoID ACID data storage and MapReduce high-speed memory.

0
source

I think you can run POC, for example, an online recursive regression algorithm in mapreduce. But remember, this will simply prove that your β€œlearning rule" works. Perhaps (I have never tried this) you can use the results in real time, telling your reducers to write them to a temporary file that can be read by another stream.

Mahout also allows you to install your database in several different SequenceFile s. You can use this to simulate an online stream and classify / cluster the online dataset. You can even copy part of the data to a folder with other data before running the algorithm. Machu in action for more on how to do this.

See if one of the following datasets is right for you: http://archive.ics.uci.edu/ml/datasets.html

0
source

I was looking for something like this -

https://www.kaggle.com/competitions

These are well-defined problems, many of which are big data problems. And some of them require real-time processing.

But thanks to everyone who answered.

-1
source

All Articles