Interpreting Output StringToWordVector () - Weka

I am trying to do a document classification using the Weka java API.

Here is my data file directory structure.

+- text_example
|
+- class1
|  |
|  3 html files
|
+- class2
|   |
|   1 html file
|
+- class3
    |
    3 html files

I have a file 'arff' created using 'TextDirectoryLoader'. Then I use the filter StringToWordVectorin the generated arff file with filter.setOutputWordCounts(true).

Below is a sample output after applying the filter. I need to clarify a few things.

@attribute </form> numeric
@attribute </h1> numeric
.
.
@attribute earth numeric
@attribute easy numeric

This huge list should be the tokenization of the contents of the source html files. is not it?

Then I,

@data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........

3 ? ( 1). 0, {0 class2,..}, {0 class3..}. , , 3- html 3 , 32, 5 . , , (), 32?

-? ? (, 100 , , , 100 . , , , ?).

- ? Weka.

, - , , StringToWordVector. (, , - , Weka?)

+5
1
  • @attribute , .
  • @data , , , . 1, ( , 0 ? ). ? Weka ( ) . , : class1 = 0.0, class2 = 1.0, class3 = 2.0. , . ( . " ARFF" http://www.cs.waikato.ac.nz/ml/weka/arff.html)
  • /, n, , , Instances, attribute(n).name() . n 0.
  • , . 100 , stringToWordVector.setWordsToKeep(100). , 100 . 100 , stringToWordVector.setDoNotOperateOnPerClassBasis(true). 100, , 100 - .
  • , , , , stringToWordVector. 100% , , , stringToWordVector , - .

Weka KnowledgeFlow, , . , , Java-. , , .

+8

All Articles