MySQL architecture for n * (n - 1) / 2 algorithm

I am currently developing a site where users can search for other users based on attributes (age, height, city, education, etc.). Now I want to implement some kind of rating between user profiles. The rating is calculated according to a proprietary algorithm based on the similarity of the two specified profiles. User A has a rating of “compliance rating” of 85 with User B and 79 with User C, for example. B and C have a rating of 94, etc.

The user should be able to search for specific attributes and filter the results by rating.

Since the rating differs from profile to profile, and also depends on the user performing the search, I can’t just add a field to my users table and use ORDER BY. So far I have come up with 2 solutions:

  • My first decision was to have an overnight batch job that calculates the rating for each possible combination of users and stores it in a separate table (user1, user2, rating). Then I can join this table with the user table and order the result by rating. After some math, I decided that this solution does not scale so well.

    Based on the formula n * (n - 1) / 2, there are 45 possible combinations for 10 users. For 1,000 users, I suddenly have to insert 499.500 rating combinations into my rating table.

  • The second solution was to leave MySQL and just calculate the rating on the fly in my application. It also does not scale well. Say a search should return only 100 results in the user interface (with the highest rating on top). If I have 10,000 users and I want to search for each user living in New York, sorted by rating, I have to download EVERY user who lives in New York in my application (say 3.000), apply the algorithm, and then only the top 100 users return. Thus, I downloaded 2,900 useless user objects from the database and wasted the CPU on the algorithm without doing anything with it.

Any ideas on how I can create this in my MySQL db or web application so that the user can have an individual rating with each other user so that the system scales to several thousand users?

+7
source share
3 answers

If you need to map each user to any other user, the O (N ^ 2) algorithm, whatever you do.

If you can use some kind of 1-dimensional "metric", you can try to associate each user with one synthetic value. But this is inconvenient and may be impossible.

But what you can do is to note which users require changes in their profiles (when any of the parameters on which the mapping is based changes). At this point, you can recalculate the table only for these users, working in O (N): if you have 10,000 users and only 10 require recalculation, you need to study 100,000 records instead of 100,000,000.

Other strategies are to run only the main algorithm for records that have a greater chance of comparison: in your example, "the same city." Or when updating records (but this will require saving (user_1, user_2, ranking, last_calculated), just recount these records with a high rating, very old or never calculated. .

UPDATE

The problem also works with O (N ^ 2) disk space.

How to reduce this space? I think I see two approaches. One is not to put some information in the correspondence table at all. The “match” function makes it more tangible, all the more rigid and cool; having ten thousand “good coincidences” would mean that coincidence means very little. Thus, we still need multiple recounts when User1 changes some key data, in case some of User1's no-no elements return to the possible zone. But we would save a smaller click of active matches for each user.

Storage will still grow quadratically, but less abruptly.

Another strategy would be to recount the match, and then we would need to develop some method to quickly select which users are likely to have a good match (thus limiting the number of lines received by the JOIN), and some method to quickly calculate the match; which may entail rewriting the correspondence between User1 and User2 with a very simple function of a subset of DataUser1, DataUser2 (possibly using additional columns).

The task would be to use the capabilities of MySQL and unload some calculations with the MySQL engine.

To do this, you can possibly “map” some data during input (therefore, in O (k)), in spatial information or in lines, and use the Levenshtein distance.

Storage for one user will grow, but it will grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.

+3
source

If a search should return only the top 100 matches, then why not just keep them? It seems you will never want to look for the lower end of the results, so just do not calculate them.

Thus, your storage space will only be o (n), not o (n ^ 2), and there should also be updates. If someone really wants to see the matches of the past first 100 (and you want to resolve them), then you have the opportunity to run the request in real time at this moment.

+2
source

I agree with everything that Easter says.

If you have a web application and users need to “log in,” then you may be able to create a user rating at this time and put them in a temporary table (or rows in an existing table).

This will work for a reasonable period of time (a few seconds) if all the data necessary for the calculation is written into memory. Then the database engine should perform a full table scan and create all the ratings.

This should work well enough for a single user login. Passive for two., But it won’t scale very well if you have, say, a dozen users registering in one second.

In fact, your rating does not scale well. You must compare all users with all users in order to get results. Regardless of whether it is batch (at night) or in real time (when someone has a request), this does not change the nature of the problem. It will use a lot of computing resources, and several users making requests at the same time will be a bottleneck.

0
source

All Articles