Discover users behind multiple user accounts according to his words.

I would like to create an algorithm to distinguish people posting on the forum under different pseudonyms.

The goal is to anonymously register users registering a new account on the planet, and not under their main account.

I basically thought about how to use the words that they use and compare users according to similarities or those words.

Users using words

As shown in the figure, user user3 and user4 use the same words. This means that there is probably only one person standing at the computer.

It is clear that there are many common words that are used by all users. Therefore, I must focus on the "user" words.

( ):

<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter

:

user1
user2
user3 = user4

Java, , .

, ?

1) /? ?

2) , ? - . , , . , " "

3) ? - - ?

.

+5
3

. (unigram, bigram, ,...) . , 0 1 (), , , , . , - . LM.

, . , KL- : , .. / , .

+1

, , , this, . .

1. /

- , . , Apple , , "", "", "iPhone" .., , , , (POS) . , , , , . , , , - . , "a" "the", POS- (, , - ), ( " " " " " " ) . , - - , . - , . , , .

, , . , 2 - .

2.

, , . , , , , - :

entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))

x - , U - , P(Ui|x) - i- x, sum - .

, , , .

3.

, . , - . . cell [3][12] # 3, # 12 ( , - !).

, , . . 1000 , 90% 0, , - .

+2

/? ?

, , - , . , , , , . - :

<word: <user#1, user#4, user#5, ...> >

, ?

, -. , ? , stackoverflow, .

?

In addition to using similarity methods or word-based methods, you can also try using user interactions. For example, user3every post likes / upvotes / comments user8, or a new user does similar things for some other (older) user in this way.

0
source

All Articles