I would like to point out a few things.
If you want to make POC with just one laptop, it makes little sense to use Hadoop.
In addition, other people claim that Hadoop is not intended for real-time applications, because there is some overhead when starting Map / Reduce jobs.
Saying this, Cloudera released Impala , which works with the Hadoop ecosystem (in particular, the hive metastat) to achieve real-time performance. Keep in mind that to achieve this, it does not create Map / Reduce jobs and is currently in beta testing mode, so use it carefully.
So I would really recommend going to Impala so you can still use the Hadoop ecosystem, but if you are also considering alternatives, here are a few other frameworks that might be useful:
- Druid : MetaMarkets was opened. It looks interesting, although I myself have not used it.
- Storm : there is no integration with HDFS, it just processes the data as it appears.
- HStreaming : integrates with Hadoop.
- Yahoo S4 : seems pretty close to Storm.
In the end, I think you should really analyze your needs and see what you need using Hadoop, because it only runs in real time. There are several more projects that could help you achieve real-time performance.
If you need project ideas for the show, I suggest looking at this link . Here are some examples:
- Finance / Insurance
- Classify investment opportunities as good or not. based on industry / company performance, portfolio diversity, and currency risk.
- Classify credit card transactions as valid or invalid, for example. location of the transaction holder and credit card, date, amount, purchased product or service, transaction history and similar transactions.
- Biology / Medicine
- Classification of proteins into structural or functional classes
- Diagnostic classification, for example. image-based cancers
- the Internet
- Classification and rating of documents
- Malware classification, email / tweet / web spam.
- Production systems (e.g. in the energy or petrochemical industries)
- Classify and detect situations (e.g. weaknesses or risk situations) based on real-time data and historical sensor data
source share