Text categorization classifiers

Does anyone know some good open source text categorization models? I know about the Stanford classifier, Weka, Mallet, etc., but they all require training.

I need to categorize news articles in Sports / Politics / Health / Gaming / etc. Are there any prepared models?

Alchemy, OpenCalais, etc. are not parameters. I need open source tools (preferably in Java).

+6
source share
4 answers

Having a pre-trained model assumes that the enclosure that was used for training belongs to the same domain as the documents you are trying to classify. As a rule, this will not give you the desired results, because you do not have the original case. Learning a car is not static, when you train the classifier, you need to update the model when new functions / information appear.

Take, for example, the classification of news articles, as you wish, in the field of sports / politics / health / games / etc.

What language? Are we only talking about English? How was the original body marked? And the biggest unknown category, etc.

Training your own classifier is really very simple. If you classify text, MALLET is the best choice. You can work up to 10 minutes. You can add MALLET to your application in less than 1 hour.

If you want to categorize news articles, there are many open source corporations that you can use as a base to start learning. I would start with Reuters-21578 or RCV-1.

+5
source

Depending on your needs, there are many classifiers. First, I think you can limit what you want to do with classifiers.

And training is part of the stages of classification, I don’t think that you will find many pre-trained classifiers. In addition, training is almost always an easy part of classification.

Thus, there are actually many resources that you can look at. I can not claim this, but this is one example:

Weka is a set of machine learning algorithms for data mining. This is one of the most popular text classification structure. It contains implementations of a wide variety of algorithms, including Naive Bayes and Vector Machine Support (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyword Retrieval Algorithm), an algorithm for extracting key phrases from text documents.

Apache Lucene Mahout is an incubator project for creating scalable distributed implementations of common machine learning algorithms on top of the Hadoop map reduction framework.

Source: http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

+2
source

What you mean by classification is very important.

Classification is a controlled task that requires a pre-labeled case in advance. Moving from an already labeled case, you need to create a model using several methods and approaches, and finally, you can classify an unmarked test case using this model. If so, you can use a classifier with several classes, which is usually the binary tree of a binary classifier application. A modern approach for this kind of tasks is to use the machine learning branch, SVM . Two of the best SVM classifiers: LibSVM and SVMlight . They are open source, easy to use, and include multiclass classification tools. Finally, you should review the literature to understand what to do, as well as get good results, because using these classifiers alone is not enough. You must manipulate / pre-process your enclosure to retrieve parts that carry information (e.g. unigrams) and exclude noisy parts. In general, you most likely have a long way to go, but NLP is a very interesting topic and worth the work.

However, if you mean classification, this is clustering, then the problem will be more complicated. Clustering is an uncontrolled task, which means that you will not include information about which example belongs to which group / topic / class. There are also academic papers on hybrid semi-controlled approaches, but they diverge slightly from the real purpose of the clustering problem. The pre-processing that you should use when managing your enclosure is similar in nature to what you should do in the classification problem, so I will not mention this anymore. Clustering requires several approaches. Firstly, you can use the LDA (Latent Dirichlet Allocation) method to reduce the dimension (the number of dimensions of your spatial space) of your enclosure, which will increase the efficiency and information of the functions. Near or after the LDA, you can use Hierarchical Clustering or similar other methods, such as K-Tools , to group your unlabeled body. You can use Gensim or Scikit-Learn as open source tools for clustering. Both are powerful, well-documented, and easy-to-use tools.

In all cases, do a lot of academic reading and try to understand the theory under these tasks and problems. Thus, you can come up with innovative and effective solutions for which you are specifically involved, because problems in NLP usually depend on your body, and you, as a rule, are on their own when dealing with your specific problem. It is very difficult to find common and ready-to-use solutions, and I also do not recommend relying on this option.

I can answer your question, sorry for the irrelevant details.

Good luck =)

+2
source

There is a long list of pre-made models for OpenNLP

http://opennlp.sourceforge.net/models-1.5/

0
source

All Articles