Starting with Data Mining

I started to study Data Mining and I want to create a small project in C ++ / Java, which allows me to use a database, say, from twitter, and then publish a specific set of results (for example, for all news in the feed). I want to know how to do this? Where to begin?

+4
source share
6 answers

This is a very broad question, so it is difficult to answer. Here are a few things to consider:

  • Where are you going to receive the data? You mention twitter, but you still have to somehow collect the data. There are probably libraries for listening to twitter streams, or you could buy data if someone sells them.

  • Where are you going to store the data? Depending on how much you have and what you plan to do with it, a traditional relational database may or may not be optimal. You might be better off with something that supports running mapreduce jobs out of the box.

Based on the answers to these questions, the choice of programming languages ​​and libraries will be easier to make.

If you're really tuned for Java, I think the Hadoop cluster is probably where you want to start. It supports writing conversion jobs in Java and works as an efficient platform for other systems such as HBase , a column-oriented data warehouse.

If your data is fairly regular (i.e. will not change much in structure from one record to the next), maybe Hive will be better suited. With Hive, you can write SQL-like queries, considering only data files as input. I have never used Mahout , but I understand that its machine learning capabilities are suitable for data mining tasks.

These are just some of the ideas that come to mind. There are many options, and the choice between them has as much as the specific problem that you are trying to solve, and your own personal tastes, like anything else.

+4
source

If you just want to start exploring Data Mining, there are two books that I especially like:

Pattern recognition and machine learning. Christopher M. Bishop. Springer

And this is what is free:

http://infolab.stanford.edu/~ullman/mmds.html

+1
source

Good links for you -

AI course taught by people who really know the subject , Weka Website , Machine Learning Datasets , More Datasets , Framework to support the development of larger datasets .

The first link is a good introduction to AI, taught by Peter Norvig and Sebastian Trun, Google Research Researcher and creator of Stanley (autonomous car), respectively.

The second link you will get to the Weka website. Download the software - it's pretty intuitive - and get a book. Make sure that you understand all the concepts: what kind of data mining, what kind of machine learning, what are the most common tasks and what are its rationales. Play a lot with examples - the software package combines some data sets - until you understand what led to the results.

Then go to real data sets and play with them. When you are solving massive data arrays, you may run into several performance issues with Weka - this is more of a learning tool as far as I can tell. Therefore, I recommend that you take a look at the fifth link in which you will be taken to the Apache Mahout website.

This is far from a simple topic, however, it is quite interesting.

+1
source

I can tell you how I did it.

1) I got the data using twitter4j .

2) I analyzed the data using the JUNG . You must define a class representing edges and a class representing vertices. These classes will contain the attributes of edges and vertices.

3) Then there is a simple function to add g.addedge edges (V1, V2, edgeFromV1ToV2) or to add g.addVertex (V) vertices.

A class that defines edges or vertices is easy to create. As an example:

`public class MyEdge {

int Id; 

} `

The same thing is done for the peaks. Today I would do it with R, but if you do not want to learn a new programming language, just import jung, which is a java library.

+1
source

Data mining is wide fields with many different methods; classification, clustering, combining and designing patterns, outlier detection, etc.

You must first decide what you want to do, and then decide which algorithm you need.

If you are new to data mining, I would recommend reading some books, such as Tan, Steinbach, and Kumar's Introduction to Data Mining.

0
source

I would suggest you use python or R for the data mining process. Performing work with java or c is a bit complicated in the sense that you need to code a lot

0
source

All Articles