Similarity between users based on voting

let's say I have a set of users, a set of songs and a set of voices for each song:

=========== =========== ======= User Song Vote =========== =========== ======= user1 song1 [score] user1 song2 [score] user1 song3 [score] user2 song1 [score] user2 song2 [score] user2 song3 [score] user3 song1 [score] user3 song2 [score] user3 song3 [score] user-n song-n [score] =========== =========== ======= 

What is the most effective way to calculate user likeness based on song voices? Is there a better way than iterating through each user and each vote for each song?

+6
python database mysql information-retrieval similarity
source share
7 answers

There are two general indicators that can be used to find similarities between users:

  • Euclidean distance is exactly what you think: imagine an n-dimensional graph that for each axis will play a song that is viewed by two involved users (u1 and * u2) and the value on its axis is an estimate. You can easily calculate the similarities using the formula:

    for each song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add everything together to sum_of_powers . Then the similarity coefficient is determined by the expression 1 / 1 + (sqrt(sum_of_powers)) .

  • Pearson correlation (or correlation coefficient): this is the best approach that determines how many two datasets are related to each other. This approach uses more complex formulas and a bit of background statistics, check it out here: wiki . You will have a graph for each pair of users, then you will draw points according to the estimates .. for example, if aSong was voted 2 from u1 and 4 from u2, it will plot point (2,4) (assuming user1 is the x axis, and u2 is the y axis).

To clarify, you use linear regression to find two coefficients A and B that describe a line that minimizes the distance from all points in the graph. This line has the following formula: y = Ax + B If two sets of similar points should be close to the main diagonal, so A should tend to 1 and B be 0. Do not assume that this explanation is complete or reference, because it does not have sufficiency and typical mathematical formalism, it is just to give get you an idea.

EDIT: as written by others, there are more complex algorithms for cluster data, for example k-mean, but I suggest you start with simple ones (in fact, you need something more complex only when you realize that the results are not enough).

+11
source share

I recommend the book Collective Intelligence Programming by Toby Segaran. Chapter 3 describes various clustering methods, such as Hierarchical Clustering and K-mean Clustering .

Source code for examples is available here.

+5
source share

If you need the most accurate results, then no, you have to go through everything.

If your database is large enough, you can simply take a statistical sample, for example, accept from 1,000 to 10,000 users and map to it.

You would also be better off adding a few more tables to the database, saving the results, and only updating them so often, rather than calculating it on the fly.

+3
source share

Ilya Grigorik made a series of recommendations on algorithms, although he focused on Ruby. It seems to be located in the machine learning section of the archives , but there is no direct link section.

+1
source share

I think many people here lack the simplicity of the question. He did not say anything about creating a rating prediction system. He just wants to calculate the similarity between the behavior of each user song rating and other characteristics of the user's song behavior. Pearson's correlation coefficient does just that. Yes, you have to iterate over all user / user pairs.

EDIT:

Thinking about it a little more:

Pearson is great if you want a similarity between the tastes of two users, but not their level of "stubbornness" ... one user who rates the series of songs 4, 5 and 6, perfectly correlates with another user who rates the same songs 3, 6 and 9. In other words, they have the same “taste” (they will rate the songs in the same order), but the second user is much more self-confident. In other words, the correlation coefficient considers any two rating vectors with a linear relationship as equal.

However, if you need a similarity between the actual ratings that users gave each song, you should use the standard error between the two rating vectors. This is a purely distance metric (linear relationships are not taken into account in similarity points), so users 4,5,6 and 3,6,9 do not have an ideal assessment of similarity.

The solution comes down to what you mean by "similar" ...

That's all.

+1
source share

If you want to do this in a rough way without going to all the records, you can use the Jaccard coefficient. Adaptation is probably required if you want to consider the results. But I think the best solutions are if your system is too large and you don’t have time to check all the records.

+1
source share

You should find a good algorithm in this book: Steven Skiena Algorithm Development Guide.

The book has a whole group of algorithms for various purposes. I think you need a graph clustering algorithm. I do not have my copy of the book, so I cannot find it for you.

A quick Google search found a Wikipedia page: http://en.wikipedia.org/wiki/Cluster_analysis Maybe this will help, but I think the book explains the algorithms more clearly.

0
source share

All Articles