Twitter / facebook comments classification by various categories

I have some comment data set that I want to categorize into five categories: -

jewelries, clothes, shoes, electronics, food & beverages 

So, if someone talks about pork, steak, wine, soda, they eat: they classify it on f & b

If someone speaks of a word - gold, pendant, medallion, etc .: it is classified into jewelry

I want to know which tags / tokens should be searched in a comment / tweet in order to classify them in any of these categories. Finally, which classifier to use. I just need some recommendations and suggestions, I take them from there.

Please, help. Thanks

+7
python facebook machine-learning twitter nltk
source share
3 answers

Well, this is a pretty big thing.

You mentioned Python, so you should take a look at the NLTK library , which allows you to process a natural language, such as your comments.

After this step, you should have a classifier that will display the words you received in a particular class. NTLK also has classification tools related to knowledge databases. If you're lucky, the categories you are looking for are already available; otherwise, you may have to build them yourself. You can see this example , which uses NTLK and the WordNet database. You can have access to Synset, which seems pretty wide; and you can also look at hypersets (see, for example, the list (dog.closure (hyper))).

Basically, you should consider using a multiclassifier for all tokenized text (Facebook comments and tweets are usually short. Consider FB comments below 200 characters, your choice). The choice of a multi-classifier is not motivated by the orthogonality of your classification set (clothes, shoes and jewelry can be one and the same object, you can have electronic jewelry [ie Smartvats], etc.). This is a fairly simple setup, but it is an interesting first step, whose strengths and weaknesses allow you to easily iterate (if necessary).

Good luck

+3
source share

This answer may be a little long, and maybe I will drop a few things, but this is just to give you an idea and some advice.

Vs Uncontrolled

As already mentioned, in the country of machine learning, there are 2 main roads: Supported and Uncontrolled learning. As you probably already know, if your case (documents) are marked, you are talking about supervised learning. Labels are categories and in this case are Boolean values. For example, if the text refers to clothing and shoes, labels for these categories should be true.

Since the text can be associated with several categories (several labels), we consider multi-classifiers.

What to use?

I assume the dataset is not tagged yet, since twitter does not do this categorization for you. So here comes a big decision on your part.

  • You tag data manually, which means that you are trying to view as many tweet / fb messages in your dataset, and for each of them you count 5 categories and respond to them with True / False.
  • You decide to use an uncontrolled learning algorithm and hope you find these 5 categories. Since approaches such as clustering will simply try to find the categories by themselves, and by default they should not match your 5 predefined categories.

In the past, I used fairly specific supervised learning and had good experience with this type of training, so I will continue to explain this way.

Feature Development

You need to come up with the features you want to use. For classifying text, a good approach is to use every possible word in the document as a function. True means that the word is present in the document, false means no.

Before doing this, you need to pre-process . This can be done using various functions provided by the NLTK library.

  • Tokenization , this will break your text into a list of words. You can use this module.
  • Removing a pause , this will remove common words from tokens. Words like 'a', ', ... You can take a look at.
  • Stalking will transform the words into their original form. For example: the words "working", "working", "work" will be converted to "work". Take a look at this one .

Now, if you have pre-processed the data, then generate a set of attributes for each word that exists in the documents. There are automatic methods and filters for this, but I'm not sure how to do this in Python.

Classification

There are several classifiers that you can use for this purpose. I suggest a deeper look at those that exist and their benefits. You can use the nltk classifier, which supports multiclassification, but to be honest, I have never tried this before. I used to use Logistic Regression and SVM.

Training and testing

You will use part of your data for training and part to verify the correct operation of the trained model. I suggest you use cross validation because you will have a small data set (you have to manually mark the data, which is cumbersome). The advantage of cross validation is that you do not need to share your data set in the training set and test set. Instead, it will work in several rounds and iterate over data for part training data and part testing data. As a result, all data is used at least once in your training data.

Prediction

Once your model is built and the result of the predictions on the “test data” is plausible. You can use your model in the wild to predict the categories of new Facebook posts / tweets.

Tools

The NLTK library is great for preprocessing and processing natural language, but I have never used it before for classification. I heard a lot of good things about the scikit python library . But honestly, I prefer to use Weka , which is a Java data mining tool that offers a great interface and speeds up your task!


From a Different Angle: Modeling Themes

In your question, you declare that you want to classify a dataset into five categories. I would like to show you the idea of ​​modeling topics. Perhaps this is not useful in your scenario if you really focus only on these categories (so I leave this part at the end of my answer). However, if your goal is to classify tweet / fb messages into non-predefined categories, modeling is the way to go.

Modeling a topic is an uncontrolled teaching method in which you pre-determine the number of topics (categories) that you want to "open". This number may be high (e.g. 40). Now it’s great that the algorithm will find 40 topics containing words that have something related. It also displays a distribution for each document that indicates which topics the document relates to. Thus, you can discover many more categories than your 5 predefined ones.

Now I'm not going to go deeper into this, but just go to Google if you need more information. In addition, you might consider using MALLET , which is a great tool for modeling topics.

+6
source share

What you are looking for relates to the topic

  • Natural Language Processing (NLP): text processing and
  • Machine learning (where classification models are built)

First I suggest going through textbooks on NLP, and then textbooks on text classification, the most suitable of which is https://class.coursera.org/nlp/lecture

If you are looking for libraries available in python or java , check out Java or Python for natural language processing

If you are new to word processing, check out the NLTK library, which is a nice introduction to NLP, see http://www.nltk.org/book/ch01.html


Now to the details of the hard core:

  • First, ask yourself if you have twitter / facebook comments (let them now call up their documents), which are manually designated using the categories you want .

    1a. If YES , look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html

    1b. If NO , look at UNsupervised training, I suggest clustering modeling and topics, http://radimrehurek.com/gensim/

  • Once you know what kind of machine learning you need, divide the documents at least into training (70-90%) and testing (10-30%) , see

    Note. I suggest, if only because there are other ways to separate your documents, for example. for development or cross validation. (if you do not understand this, everything is in order, just follow step 2)

  • Finally, go and test your model.

    3a. If supervised , use the training kit to train your supervised model. Apply your model to the test suite, and then see how well you performed.

    3b. If not supported , use the training set to create clusters of documents (which means grouping similar documents), but they still don't have tags. Therefore, you need to think of some smart way to properly label groups of documents. (By this date there is no real good solution to this, even super-efficient neural networks cannot know what neurons excite, they just know that each neuron shoots with something specific)

+2
source share

All Articles