How to use libsvm to classify text?

I would like to write a spam filtering program using SVM, and I choose libsvm as a tool.
I have 1000 emails and 1000 spam emails, then I classify them into:
700 letters good_train 700 letters spam_train
300 letters good_test 300 letters spam_test
Then I wrote a program to count the time of each word in each file that received the result:

good_train_1.txt: today 3 hello 7 help 5 ... 

I found out that libsvm needs a format like:

1 1: 3 2: 1 3: 0
2 1: 3 2: 3 3: 1
1 1: 7 3: 9

. I know that 1, 2, 1 is a label, but what does 1: 3 mean?
How can I convey what I have in this format?

+4
source share
2 answers

Probably format

 classLabel attribute1:count1 ... attributeN:countN 

N is the total number of different words in your text body. You will need to check the documentation for the tool you are using (or its sources) to see if you can use a more permitted format, not counting attributes that have a score of 0.

+4
source
 How could I transfer what I've got to this format? 

This is how I do it. I would use a script to calculate the word count for each post in a training set. Then use another script and transfer this data to the LIBSVM format that you specified earlier. (This can be done in many ways, but it should be wise to write using a simple I / O language such as Python). I would put all the “good mail” data in one file and label this class as “1”, then I would do the same process with the “spam mail” data and mark this class “-1”. As nologin said, LIBSVM requires the class label to precede the functions, but the functions themselves can be any number if they are in ascending order, for example. 2: 5 3: 6 5: 9 is allowed, but not 3:23 1: 3 7: 343.

If you are concerned that your data is not in the correct format, use their script

 checkdata.py 

before training, and he should report any possible errors.

Once you have two separate data files in the correct format, you can call

 cat file_good file_spam > file_training 

and create a training file containing data for both good and spam mail. Then follow the same process with the test suite. One psychological advantage when generating data in this way is that you know that the top 700 (or 300) letters in the training (or testing) set are good mail, and the rest are spam mail. This makes it easy to create other scripts that you can use for data, such as precision / call code.

If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer several, as well as various README files that come with the installation. (I personally found README in the Tools and Python directories to be a great good.) Unfortunately, the FAQ often does not address what nologin said that the data is in a sparse format.

In the final note, I doubt that you need to keep a record of all the possible words that may appear in the mail. I would recommend counting only the most common words that, in your opinion, will appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you think might be useful.

+1
source

All Articles