Text classification using Java

I need to classify a text or word in a specific category. For example, the text “Pink Floyd” should be classified as “music” or “Wikimedia” as “technology” or “Einstein” as “science”.

How can I do that? Is there a way I can use DBpedia for the same? If not, you need to train the database from time to time, right?

+4
source share
5 answers

This is a text classification . The Rahgavan study guide and Schütze Information is a great introduction. I think you do not need DBPedia or NER for this, just a small training dataset with enough shortcuts for all your classes.

+3
source

Yes, DBpedia may be a good choice for this kind of problem. You need

  • Align the DBpedia category structure to get the right granularity (for example, Pink Floyd is listed in Capitol Records artists and many other categories, but not under Music ). Perhaps select a few large categories and try to determine if your concepts are indicated indirectly in them;
  • normalize the text; Einstein is listed as Albert Einstein , not einstein
  • deal with ambiguity due to terms describing several concepts and concepts belonging to several top-level categories.

These problems can be solved with the help of machine learning, but I can only see how this can be done if you extract these terms together with the corresponding functions from the executable text. But in this case, you can simply classify the entire text into one of the categories that you select in step 1.

+3
source

This is a well-studied problem with the name of object recognition . If you don’t have a special need to roll your technology (hint: this is a difficult problem in general), using Gate or, possibly, one of the online services based on it (for example, TSO Data Enrichment Service ) would be a good option. Alternative online service OpenCalais .

+1
source
  • Matching your categories with DBPedia.
  • Index with lucene selected DBPedia categories and tags with your category names.
  • Search for your data - tokenization, normalization will be performed by Lucene.

This approach is somehow related to the KNN classification.

+1
source

Yes, DBpedia is a good choice for classifying text, because you can use its predicates / relationships to query and extract meaningful information for a specific category.

You can look at the endpoint for a Dbpedia request: http://dbpedia.org/sparql

Next, learn the basic SPARQL syntax for querying an endpoint at the following link: http://www.w3.org/TR/rdf-sparql-query/

+1
source

All Articles