Spam Filtering Efficiency in Naive Bayesian Filters

How effective is naive Bayesian filtering to filter spam?

I heard that spammers easily get around them by typing extra words that are not related to spam. What programming methods can you use with Bayesian filters to prevent this?

+7
spam-prevention bayesian
source share
4 answers

Paul Graham was the guy who really introduced the idea of ​​using Bayesian spam filtering on the Internet as a whole with his original article Spam Plan , back in August 2002. Then his follow-up after a year presented many of the problems that quickly arose. This is still quite a lot of work on this topic.

In a second article, Graham mentions using CRM114 , which works on a much wider range of patterns than just delimited words. CRM114 is cool, but without the help of an implementation for a spam filtering system.

There are open source power tools for Bayesian spam filtering, such as Death2Spam and SpamProbe .

I do not find anything like filtering mail through a Gmail account. Happy hunt.

+7
source share

I think that in order to defeat the similar spam attack that you mentioned, it is not the training method that is important, but what functions you train. I use Fidelis Assis OSBF-Lua , which is a very successful filter: it holds winning contests for spam filters. He uses Bayesian training, but I believe that three principles are the real reason for his success:

  • He trains not on a single word, but on sparse bitrams: a pair of words divided by 0 into 4 words “don't care”. Spammers must post their message somewhere, and rare Broadramas are very good at understanding them. He even finds spam spam!

  • He provides additional training on message headers because it is difficult for them to mask spammers. Example: a message that occurs on your network and never passes through a host outside the network is probably not spam.

  • If a spam filter has low confidence in its classification, it requests information from a person. (In practice, he adds the “Please teach me this message” header field, a person can ignore the request.) This means that when spammers change new methods, your filter will develop to match.

This combination of methods is extremely effective.

Disclaimer: I worked with Fidelis to refactor some software so that it could be used for other purposes, such as categorizing regular mail into groups, or perhaps one fine day, trying to detect spam in blog comments and other places.

+5
source share

You are right, naive Bayesian filters are susceptible to Bayesian poisoning .

+1
source share

I use Popfile to not only sort spam, but also sort my email by category, and I find it extremely effective. It uses naive Bayesian filters.

+1
source share

All Articles