How to submit text for classification in weka?

Can you tell me how to represent an attribute or class to classify text in weka. What attribute can I classify with? word frequency or just a word? What is the possible structure of the ARFF format? Can you give me some lines of an example of this structure?

Thank you in advance.

+7
source share
2 answers

One of the easiest alternatives is to start with an ARFF file for a problem with two classes, for example:

  @relation corpus 

 @attribute text string
 @attribute class {pos, neg}

 @data
 'long text with words ...', pos

The text is represented as a String type, and the class is nominal with two values.

Then you can apply two filters:

  • StringToWordVector , which converts texts into a representation of a vector word. The filter uses an attribute for each word. You can configure options for selecting binary / frequency representations, stems or stop words. The best performance depends on the problem. If the text is not long, a binary representation is usually sufficient.
  • To reorder to move the class attribute to the last position, Weka assumes that it is.

You can find more information and other approaches to converting your data on this Wiki page: http://weka.wikispaces.com/Text+categorization+with+WEKA

+11
source

In weka you can choose your own attribute. In this example, we have only 2 classes, and all unique words are used as attributes. If you choose the frequency of words as your attribute, then you assign “2” if this word appears twice in your text, and “0” if not, or “1” if this word occurs only once.

Here is an example of the .arff format.

@RELATION anyrelation @ATTRIBUTE word1 @ATTRIBUTE word2 ... @ATTRIBUTE wordn @ATTRIBUTE class {class1, class2} @DATA 1,2,....,0,class1 0,3,....,1,class2 
0
source

All Articles