TDD and Bayesian spam filter problem

Question

TDD and Bayesian spam filter problem

It is well known that Bayesian classifiers are an effective way to filter spam. They may be concise enough (ours are only a few hundred LoCs), but all the main code must be written before you get any results at all.

However, the TDD approach provides that only the minimum amount of code to pass the test can be written, so the following method signature is provided:

bool IsSpam(string text)

And the following line of text, which is clearly spam:

 "Cheap generic viagra"

The minimum code size that I could write:

 bool IsSpam(string text) { return text == "Cheap generic viagra" }

Now, perhaps I will add another test message, for example.

 "Online viagra pharmacy"

I could change the code to:

 bool IsSpam(string text) { return text.Contains("viagra"); }

... etc. etc. Until at some point the code becomes random checks of strings, regular expressions, etc., because we developed it, and did not think about it and wrote it differently from the very beginning.

So, how should TDD work with this type of situation when developing code from the simplest possible code to pass the test is the wrong approach? (In particular, if it is known in advance that the best implementations cannot be trivially developed).

+4

tdd machine-learning nlp classification

Greg beech Apr 20 '09 at 8:45

source share

9 answers

Esko Luontola · Answer 1 · 2009-04-20T09:23:41+0000

Start by writing tests for the lower level components of the spam filter algorithm.

First you need to keep in mind the rough design of how the algorithm should be. Then you isolate the main part of the algorithm and write tests for it. In the case of a spam filter that could calculate some simple probability using the Bayesian theorem (I don't know Bayesian classifiers, so I could be wrong). You build it from the bottom up, step by step, until, finally, you complete all the parts of the algorithm and connect them.

It takes a lot of practice to find out which tests to write in which order so that you can do TDD in fairly small steps. If you need to write a lot more than 10 lines of code to pass one new test, you are probably doing something wrong. Start with something less or mock some dependencies. This is more safely mistaken on the smaller side, so the steps are too small and your progress is slow than trying to take too big steps and fail.

An example of “Cheap Generic Viagra" that you might have is better suited for an acceptance test . It will probably work even very slowly, because you first need to initialize the spam filter with sample data, so it will not be useful as a TDD test. TDD tests must be FIRST (F = Fast, as in many hundreds or thousands of tests per second).

Ankur · Answer 2 · 2009-04-20T08:51:24+0000

Here's my take: Test Driven Development means written tests before coding. This does not mean that every unit of code for which you write a test should be trivial.

In addition, you still need to plan your software to complete your tasks in a smart and efficient way. Just adding more and more lines is apparently not the best design for this problem.

In short, you write code with the smallest possible functionality (and test it), but you do not create your own algorithm (in pseudo-code or, as you want to do it) in this way.

It would be interesting to see if you agree with others.

mouviciel · Answer 3 · 2009-04-20T09:32:09+0000

For me, what you call the minimum amount of code to pass the test is the entire IsSpam() function. This matches its size (you say only a few hundred LoC).

Alternatively, the incremental approach does not pretend to code and think later. You can design a solution, encode it, and then refine the design using special cases or a better algorithm.

In any case, refactoring is not only about adding new material over the old. For me, this is a more destructive approach when you throw away old code for a simple function and replace it with new code for a refined and more complex function.

Lennaert · Answer 4 · 2009-04-20T09:34:47+0000

You have your unit tests, right?

This means that now you can reorganize the code or even rewrite it and use unit tests to see that you broke something.

Make it work first, and then clean it - time for the second step :)

Daniel Daranas · Answer 5 · 2009-04-20T09:39:42+0000

(1) You cannot say that the string “is spam” or “not spam” is the same as if you were saying whether the number is prime. It is not black or white.

(2) It is wrong, and certainly not the purpose of TDD, to write string processing functions using only the examples used for the tests. Examples should be of some value. TDD does not protect against silly implementations, so you should not pretend that you have no clue, therefore you should not write return text == "Cheap generic viagra" .

Silverfish · Answer 6 · 2009-04-20T11:20:48+0000

It seems to me that with a Bayesian spam filter you should use existing methods. In particular, you would use Bayes' theorem and probably some other probability theory.

In this case, it is best to choose your algorithm based on these methods, which should be either tested or, possibly, experimental. Then, your unit tests should be designed to check if ispam implements the algorithm you are taking correctly, as well as the base test, the result of which is from 0 to 1.

The fact is that your unit tests are not designed to check if your algorithm is reasonable. You must either know what is already, or perhaps your program was designed as an experiment to make sure it is reasonable.

Not to say that the isspam function is not important. But this should not be part of unit testing. Data can be obtained from feedback from alpha testing, new theoretical results, or your own experiments. In this case, a new algorithm may be required, and new unit tests are needed.

See also this question about testing random number generators.

Stompchicken · Answer 7 · 2009-04-23T10:08:02+0000

The problem here is not in test-based development, but in your tests. If you start developing code in a single test, then all you do is specify a string check function.

The main idea of TDD is to think about your tests before writing code. You cannot exhaustively test a spam filter, but you can come up with a reasonable approximation by tens or hundreds of thousands of test documents. With many tests, the naive Bayes algorithm is a simpler solution than one hundred thousand linear switch operators.

In fact, you will not be able to pass 100% of your unit tests, so you just need to try to pass as many as possible. You should also make sure your tests are realistic enough. If you think about it this way, test development and machine learning have a lot in common.

Tim ottinger · Answer 8 · 2010-03-02T18:56:24+0000

The problem you are describing is theoretical that by adding cool in response to the tests you will make a big dirty ball of dirt. What you are missing is very important.

Cycle: Red → Green → Refactoring

You do not just jump between red and green. Once your tests pass (green), you will reorganize your production code and tests. Then you write the next failed test (red).

If you are refactoring, you eliminate duplication and clutter and crap as it grows. You will quickly get to the methods of extraction, constructing assessment and evaluation and, possibly, attracting external tools. You will do this as soon as it is the simplest thing that will work.

Don't just jump between red and green, or all your code will be guessed. This refactoring step is not optional or discretionary. It is important.

Frank schwieterman · Answer 9 · 2010-03-02T19:14:50+0000

I don’t think that checking whether a particular line is spam is really unit test, it is rather a client check. There is an important difference, since it is not a red-greedy type of thing. In fact, you should have several hundred test documents. Initially, some will be classified as spam, and as the product improves, the classification will more closely match what you want. Therefore, you must create your own application to download a bunch of test documents, classify them, and then evaluate the overall result. When you are done with this client test, the evaluation will be very bad, because you did not implement the algorithm. But now you have the means to measure future progress, and this is very valuable, given the amount of training / change / experiment you can expect in the future.

When you implement your algorithm (and even a client test from your own experience), you can still do TDD with real unit tests. The first test for a Bayesian filter component will not be measured if a specific string is evaluated as spam, but if the string matches through a Bayesian filter component. Then your subsequent tests will focus on how the Bayesian filter is implemented (properly structuring nodes, using training data, etc.).

You need an idea of where the product is going, and your tests and implementation should be directed towards this vision. You also can’t just add customer tests blindly, you need to add tests with a common vision of the product. Any goal of software development will have good tests and bad tests that you can write.

TDD and Bayesian spam filter problem

More articles: