This is a very broad question, so it is difficult to answer. Here are a few things to consider:
Where are you going to receive the data? You mention twitter, but you still have to somehow collect the data. There are probably libraries for listening to twitter streams, or you could buy data if someone sells them.
Where are you going to store the data? Depending on how much you have and what you plan to do with it, a traditional relational database may or may not be optimal. You might be better off with something that supports running mapreduce jobs out of the box.
Based on the answers to these questions, the choice of programming languages ββand libraries will be easier to make.
If you're really tuned for Java, I think the Hadoop cluster is probably where you want to start. It supports writing conversion jobs in Java and works as an efficient platform for other systems such as HBase , a column-oriented data warehouse.
If your data is fairly regular (i.e. will not change much in structure from one record to the next), maybe Hive will be better suited. With Hive, you can write SQL-like queries, considering only data files as input. I have never used Mahout , but I understand that its machine learning capabilities are suitable for data mining tasks.
These are just some of the ideas that come to mind. There are many options, and the choice between them has as much as the specific problem that you are trying to solve, and your own personal tastes, like anything else.
source share