You can build a character model for character transitions from a bunch of English text. So, for example, you will find out how common this is for "h" after "t" (quite often). In English, you expect that after "q" you get "u". If you get a “q” followed by something other than “u”, this will happen with a very low probability, and therefore it should be quite alarming. Normalize the calculations in the tables so that you have a chance. Then, for the query, go through the matrix and calculate the product of the transitions that you take. Then normalize the length of the request. When the number is low, you probably have a mysterious request (or something in another language).
If you have a bunch of query logs, you can first create a general text model in English, and then heavily load your own queries at this stage of model training.
For background read Markov chains .
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True is this thing working? True i hope so True t2 chhsdfitoixcv False ytjkacvzw False yutthasxcvqer False seems okay True yay! True
Rob Neuhaus Jun 09 '11 at 19:30 2011-06-09 19:30
source share