This answer may be a little long, and maybe I will drop a few things, but this is just to give you an idea and some advice.
Vs Uncontrolled
As already mentioned, in the country of machine learning, there are 2 main roads: Supported and Uncontrolled learning. As you probably already know, if your case (documents) are marked, you are talking about supervised learning. Labels are categories and in this case are Boolean values. For example, if the text refers to clothing and shoes, labels for these categories should be true.
Since the text can be associated with several categories (several labels), we consider multi-classifiers.
What to use?
I assume the dataset is not tagged yet, since twitter does not do this categorization for you. So here comes a big decision on your part.
- You tag data manually, which means that you are trying to view as many tweet / fb messages in your dataset, and for each of them you count 5 categories and respond to them with True / False.
- You decide to use an uncontrolled learning algorithm and hope you find these 5 categories. Since approaches such as clustering will simply try to find the categories by themselves, and by default they should not match your 5 predefined categories.
In the past, I used fairly specific supervised learning and had good experience with this type of training, so I will continue to explain this way.
Feature Development
You need to come up with the features you want to use. For classifying text, a good approach is to use every possible word in the document as a function. True means that the word is present in the document, false means no.
Before doing this, you need to pre-process . This can be done using various functions provided by the NLTK library.
- Tokenization , this will break your text into a list of words. You can use this module.
- Removing a pause , this will remove common words from tokens. Words like 'a', ', ... You can take a look at.
- Stalking will transform the words into their original form. For example: the words "working", "working", "work" will be converted to "work". Take a look at this one .
Now, if you have pre-processed the data, then generate a set of attributes for each word that exists in the documents. There are automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are several classifiers that you can use for this purpose. I suggest a deeper look at those that exist and their benefits. You can use the nltk classifier, which supports multiclassification, but to be honest, I have never tried this before. I used to use Logistic Regression and SVM.
Training and testing
You will use part of your data for training and part to verify the correct operation of the trained model. I suggest you use cross validation because you will have a small data set (you have to manually mark the data, which is cumbersome). The advantage of cross validation is that you do not need to share your data set in the training set and test set. Instead, it will work in several rounds and iterate over data for part training data and part testing data. As a result, all data is used at least once in your training data.
Prediction
Once your model is built and the result of the predictions on the “test data” is plausible. You can use your model in the wild to predict the categories of new Facebook posts / tweets.
Tools
The NLTK library is great for preprocessing and processing natural language, but I have never used it before for classification. I heard a lot of good things about the scikit python library . But honestly, I prefer to use Weka , which is a Java data mining tool that offers a great interface and speeds up your task!
From a Different Angle: Modeling Themes
In your question, you declare that you want to classify a dataset into five categories. I would like to show you the idea of modeling topics. Perhaps this is not useful in your scenario if you really focus only on these categories (so I leave this part at the end of my answer). However, if your goal is to classify tweet / fb messages into non-predefined categories, modeling is the way to go.
Modeling a topic is an uncontrolled teaching method in which you pre-determine the number of topics (categories) that you want to "open". This number may be high (e.g. 40). Now it’s great that the algorithm will find 40 topics containing words that have something related. It also displays a distribution for each document that indicates which topics the document relates to. Thus, you can discover many more categories than your 5 predefined ones.
Now I'm not going to go deeper into this, but just go to Google if you need more information. In addition, you might consider using MALLET , which is a great tool for modeling topics.