How could I transfer what I've got to this format?
This is how I do it. I would use a script to calculate the word count for each post in a training set. Then use another script and transfer this data to the LIBSVM format that you specified earlier. (This can be done in many ways, but it should be wise to write using a simple I / O language such as Python). I would put all the “good mail” data in one file and label this class as “1”, then I would do the same process with the “spam mail” data and mark this class “-1”. As nologin said, LIBSVM requires the class label to precede the functions, but the functions themselves can be any number if they are in ascending order, for example. 2: 5 3: 6 5: 9 is allowed, but not 3:23 1: 3 7: 343.
If you are concerned that your data is not in the correct format, use their script
checkdata.py
before training, and he should report any possible errors.
Once you have two separate data files in the correct format, you can call
cat file_good file_spam > file_training
and create a training file containing data for both good and spam mail. Then follow the same process with the test suite. One psychological advantage when generating data in this way is that you know that the top 700 (or 300) letters in the training (or testing) set are good mail, and the rest are spam mail. This makes it easy to create other scripts that you can use for data, such as precision / call code.
If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer several, as well as various README files that come with the installation. (I personally found README in the Tools and Python directories to be a great good.) Unfortunately, the FAQ often does not address what nologin said that the data is in a sparse format.
In the final note, I doubt that you need to keep a record of all the possible words that may appear in the mail. I would recommend counting only the most common words that, in your opinion, will appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you think might be useful.