Is there a way to detect strings like putjbtghguhjjjanika?

Search for people on my website and some of these searches:

tapoktrpasawe qweasd qwa as aıe qwo ıak kqw qwe qwe qwe a 

My question is: is there a way to detect strings similar to the ones above?

I believe it is impossible to detect 100% of them, but any solution would be welcome :)

edit: I mean "search for laughter." For example, some people search for strings such as "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to find gossip searches.

It doesn't matter if the search result is 0 or something else. I can not use this logic.

Some new brands or products will be ignored if I consider “ordinary words”.

thanks for the help

+53
string algorithm php
Jun 09 '11 at 19:12
source share
7 answers

You can build a character model for character transitions from a bunch of English text. So, for example, you will find out how common this is for "h" after "t" (quite often). In English, you expect that after "q" you get "u". If you get a “q” followed by something other than “u”, this will happen with a very low probability, and therefore it should be quite alarming. Normalize the calculations in the tables so that you have a chance. Then, for the query, go through the matrix and calculate the product of the transitions that you take. Then normalize the length of the request. When the number is low, you probably have a mysterious request (or something in another language).

If you have a bunch of query logs, you can first create a general text model in English, and then heavily load your own queries at this stage of model training.

For background read Markov chains .

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

 my name is rob and i like to hack True is this thing working? True i hope so True t2 chhsdfitoixcv False ytjkacvzw False yutthasxcvqer False seems okay True yay! True 
+126
Jun 09 '11 at 19:30
source share

Assuming that you are looking for gossip ... This will be more of a problem than it's worth. You provide them with a search function, let them use it as they please. I am sure that there are some algorithms that detect strange groupings of characters, but this will probably be more resources / labor-intensive than just the lack of results.

+9
Jun 09 '11 at 19:17
source share

You can do what Stackoverflow does and compute the entropy of the string .

Of course, this is just one of many heuristics used by SO to determine low-quality responses, and they cannot be relied upon as 100% accurate.

+7
Jun 09 2018-11-11T00:
source share

I would think that you can detect these lines in the same way as you could detect "ordinary words". It’s just matching the pattern, no?

As for why users are looking for these lines, this is a big question. You may be able to eradicate gibberish in another way. For example, if he comments on spam phrases that people (or script) are looking for, then install CAPTCHA.

Change Another end to the interpretation of input is to slightly reduce it. Allow searches every 10 seconds or so. (I remember that I saw this on the forum software, as well as in various places on SO.) This will distract sdfpjheroptuhdfj from searching again and again and at the same time will not interfere with users who search for and find their material.

+4
Jun 09 '11 at 19:17
source share

As some people commented, there are no clicks on google for tapoktrpasawe or putjbtghguhjjjanika (well, now there is, of course), so if you have a way to do a quick Google search through the API, you can throw away any search that didn’t get Google results and wasn’t named one of your products. Why you want to do this is another question: are you trying to save effort for your search library? Make your manual review of “popular search terms” more meaningful? Or are you just upset by the inexplicable behavior of some people in a large wide Internet space? If this is the last, my advice will just let it go, even if there is a way to prevent it. Some other oddity will come.

+3
Jun 09 '11 at 19:36
source share

If the search is performed on products, you can cache their names or codes and check them before this list before querying the database. In addition, if your site is intended for English users, you can create a dictionary of strings that are not used in English, for example qwkfagsd. Which, and agreeing with another answer, will be more resource-intensive than if it weren’t.

0
Jun 09 '11 at 19:17
source share

I think that checking for one consonant followed by a vowel or two consonants followed by a vowel usually means a spoken word. Otherwise, it will be garbage (with the exception of a very small number of words). I think this will take care of 98% of garbage and common sense.

Think about it. 3 consonants in a row can immediately warn garbage text.

-one
Aug 28 '17 at 10:01
source share



All Articles