How to check if a string is randomized, or generated by a person and pronounced?

In order to identify the [possible] bot-generated usernames.

Suppose you have a username, for example "bilbomoothof". It may be nonsense, but it still contains pronounced sounds and therefore appears as a generated person.

I agree that it could be accidentally formed from a dictionary of syllables or parts of a word, but suppose for a moment that this bot is a bit garbage.

  • Suppose you have the username "sdfgbhm342r3f", this is clearly a random string to the person. But can it be determined programmatically?
  • Are there algorithms available (similar to Soundex, etc.) that can identify pronounced sounds inside a string like this?

The solutions applicable in PHP / MySQL are most appreciated.

+52
algorithm mysql nlp spam phonetics
Jul 22 '09 at 9:48
source share
10 answers

I think you could think of something like this if you could limit yourself to pronounced sounds in English . For me (I'm French) words like szczepan or wawrzyniec are unpronounceable and certainly have a certain chance.

But they are actually Polish names (which means steven and lawrence) ...

+16
Jul 22 '09 at 9:59
source share

I agree with the Mac. But more than that, people sometimes have a username that is not pronounced qwerty or rtfmorleave.

Why bother with this?

<outdated and false, but I'm not deleting due to comments>

But more than that, no bots use "zetztzgsd" as the username , they have dictionnary of realname, possible nickname, etc., so I think this will be a waste of time for you

</ obsolete and false, but I am not deleting due to comments>

+8
Jul 22 '09 at 10:03
source share

See the analysis of n-grams. It has been successfully used to automatically detect a text language and works surprisingly well even on very short texts.

An online demo (no longer online) recognized bilbomoothof as English and sdfgbhm342r3f as Nepali. He probably always returns the best match, even if he is very bad. I think you could train him to distinguish between "spoken" and "random."

+8
Jul 22 '09 at 11:20
source share

Just use CAPTCHA as part of the registration process.

You will never be able to distinguish real uesrnames from usernames created by the bot, without much annoying your users.

You will block users with bizzare or non-English names, which will annoy them, and the bots will just try until they understand a good username (from a dictionary or other sources - This is very nice , by the way!).

EDIT: Looking for a prophylaxis, not after analyzing a fact?

The solution allows someone else to manage user IDs. For example, you can use a small list of OpenID providers (like SO) or facebook connect , or both. You know for sure that the users are real and that they solve at least one CAPTCHA.

EDIT: another idea

Find the line in Google and check the number of matches found. It should not be your only tool, but it is also a good indicator. Randomized strings, of course, must have small or missing matches.

+3
Jul 22 '09 at 10:51
source share

Reply to Question # 1:

Unfortunately, this cannot be done, since the Kolmogorov complexity function is not computable, therefore, you cannot generate such an algorithm if you do not apply some rules for the domain of possible user names, then you can perform a heuristic analysis and decide, but even then it's really hard to do.

PS: After you sent this answer, I came across some service that gave the idea of โ€‹โ€‹an example to limit the domain name of a user, so that users use the mailbox of a well-known public domain as user names.

+2
Jul 22 '09 at 9:55
source share

On top of my head, you can search for syllables using soundex . This is a direction that I would explore based on the assumption that the spoken word has at least one syllable.

EDIT: here is the syllable counting function:

function count_syllables($word) { $subsyl = Array( 'cial' ,'tia' ,'cius' ,'cious' ,'giu' ,'ion' ,'iou' ,'sia$' ,'.ely$' ); $addsyl = Array( 'ia' ,'riet' ,'dien' ,'iu' ,'io' ,'ii' ,'[aeiouym]bl$' ,'[aeiou]{3}' ,'^mc' ,'ism$' ,'([^aeiouy])\1l$' ,'[^l]lien' ,'^coa[dglx].' ,'[^gq]ua[^auieo]' ,'dnt$' ); // Based on Greg Fast Perl module Lingua::EN::Syllables $word = preg_replace('/[^az]/is', '', strtolower($word)); $word_parts = preg_split('/[^aeiouy]+/', $word); foreach ($word_parts as $key => $value) { if ($value <> '') { $valid_word_parts[] = $value; } } $syllables = 0; // Thanks to Joe Kovar for correcting a bug in the following lines foreach ($subsyl as $syl) { $syllables -= preg_match('~'.$syl.'~', $word); } foreach ($addsyl as $syl) { $syllables += preg_match('~'.$syl.'~', $word); } if (strlen($word) == 1) { $syllables++; } $syllables += count($valid_word_parts); $syllables = ($syllables == 0) ? 1 : $syllables; return $syllables; } 

From this very interesting link:

http://www.addedbytes.com/php/flesch-kincaid-function/

+2
Jul 22 '09 at 9:56
source share

You can use a neural network to evaluate whether an alias looks like an alias in natural language.

Collect two sets of data: one of the valid aliases and one of the dummy ones. Train simple back-progating single hidden neural network layer with symbolic values โ€‹โ€‹as input. The neural network will learn to distinguish between strings such as "zrgssgbt" and "zargbyt", since the latter have mixed consonants and vowels.

It is important to use real-world examples to get a good discriminator.

+2
Jul 22 '09 at 11:02
source share

I do not know the existing algorithms for this problem, but I think that it can be attacked in one of the following ways:

  • Your bot may be garbage, but you can save a list of syllables, or, more specifically, phonemes that you can try to find on this line. But that sounds a bit complicated, because you will need to segment the string in different places, etc.
  • there are 5 vowels in the English alphabet and another 21. You can assume that if they were randomly generated, then you would expect 5/26 * W, (where W is the word length) letters that are vowels, and significant deviations this may be suspicious. (If the letter is included, then 5/31, etc.). You can try to build this idea by looking for duplexes and trying to make sure that each doublet occurs with the same probability, etc.
  • further, you can try to segment the input line around the vowels, for example, three dictionaries before the vowel and three letters after the vowel, and try to find out if it makes a recognizable sound compared to phonemes.
0
Jul 22 '09 at 10:00
source share

In Russian, we have forbidden syllables, such as , and or after a vowel, etc.

However, spam bots just use a database of names, so my spam mailbox is full of strange names that you can only find in history books.

I expect that in English histograms of syllable distribution will also be displayed (for example, ETAOIN SHRDLU , but for two-letter or even three-letter syllables), and the critical density of low-frequency syllables in one name is a sign.

0
Jul 22 '09 at 10:01
source share

Please note that many large sites offer usernames such as [first init] [middle init] [last name] [number]. Users then transfer these usernames to other sites, and the first three letters can definitely not be implemented.

0
Jul 28 '09 at 1:52
source share



All Articles