I am trying to do a document classification using the Weka java API.
Here is my data file directory structure.
+- text_example
|
+- class1
| |
| 3 html files
|
+- class2
| |
| 1 html file
|
+- class3
|
3 html files
I have a file 'arff' created using 'TextDirectoryLoader'. Then I use the filter StringToWordVectorin the generated arff file with filter.setOutputWordCounts(true).
Below is a sample output after applying the filter. I need to clarify a few things.
@attribute </form> numeric
@attribute </h1> numeric
.
.
@attribute earth numeric
@attribute easy numeric
This huge list should be the tokenization of the contents of the source html files. is not it?
Then I,
@data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........
3 ? ( 1).
0, {0 class2,..}, {0 class3..}.
, , 3- html 3 , 32, 5 . , , (), 32?
-? ? (, 100 , , , 100 . , , , ?).
- ? Weka.
, - , , StringToWordVector. (, , - , Weka?)