The algorithm for calculating the total membership in groups with big data

I need to write a program that counts the number of times two users are in the same group. Users get the username and group by ID. For example, with input (saved in a text file):

john 32 john 21 jim 21 jim 32 bob 32 

I want to get the result:

 john-jim 2 john-bob 1 jim-bob 1 

That sounds trivial. But the problem is that I have 1.8 million groups and 300,000 users. And a lot of membership (I expect at least an average of 50 per user, possibly a lot more). This means a HUGE amount of data and processing.

I wrote 5 different programs that did this, none of which could reduce the amount of data: it was too slow, like a PostgreSQL query. Too much memory working on the Map in the working memory of Java (first place in the heap, after optimization I got a rare "GC limit exceeded"). Writing to the database with Java is too slow (even when optimizing with batch queries). Growing despair, I tried some more exotic things, such as writing all the pairs to an array and then sorting them (O (n log (n))) and then counting them peu à peu. But it was too much data to store in memory.

Any ideas on an algorithm for this? Or is it impossible?

+6
source share
3 answers

RDBMS specializes in operations such as sorting. Doing this outside the database is unlikely to ever come close to performance. Do it with SQL!

This would do the job (simplified in updating):

 SELECT t1.usr || '-' || t2.usr, count(*) AS ct FROM usr_grp t1 JOIN usr_grp t2 USING (grp_id) WHERE t2.usr > t1.usr -- prevent dupes and get sorted pair GROUP BY t1.usr, t2.usr; 

Depending on how many overlaps you have, this potentially creates a huge number of lines, as you said. So it will never be fast.

The question is: What is the purpose of creating millions of lines that no one can process? Are you sure the operation makes sense to start?

To make it faster, you could ..

  • Refresh! PostgreSQL 8.4 is already deprecated . In particular, PostgreSQL 9.2 focused on big data. You can expect much better performance for such a job.
    And no one should work 8.4.0. Just for security reasons, but you also miss a lot of mistakes. The current point release is 8.4.17. I quote a related website:

We always recommend that all users run the latest available release for any major version.

  • use integer as a surrogate key for users, so you only deal with integers in usr_grp . Makes the table and indexes smaller and faster. If the n: m table ( usr_grp ) has much more power than the usr table, this should be faster, even if it means additional joins.

 SELECT u1.usr || '-' || u2.usr, count(*) AS ct FROM usr_grp t1 JOIN usr_grp t2 USING (grp_id) JOIN usr u1 ON t1.usr_id = u1.usr_id JOIN usr u2 ON t2.usr_id = u2.usr_id WHERE t2.usr_id > t1.usr_id GROUP BY u1.usr_id, u2.usr_id; 
  • Create an index with multiple columns (if you don't have one already).
    grp_id should be the first. Why does it matter?

  CREATE INDEX usr_grp_gu_idx ON usr_grp(grp_id, usr_id); 

Test case

I wrote down @OldCurmudgeon numbers for my test case and created a comparable test case in PostgreSQL.

-> SQLfiddle demo

~ 250 ms in this open test database.
The result is not ordered (no ORDER BY ), as it is not specified.
Compared to 2.5 minutes , reported below . Factor 600.

+7
source

How to skip your file system.

For each entry, open a file with a name for the group identifier and add a new username. You will receive one file for each group.

Now you have - for example:

 Group-21.txt jim john Group-32.txt bob jim john 

Now run all the files, creating each pair of user names in it (I would sort the names and perform the standard combination process on them). For each pair, add “1” to the file with a specific name.

Now you have - for example:

 User-jim-john.txt 11 User-bob-jim.txt 1 User-bob-john.txt 1 

Now you have pairs in file names and counts (in unal so that all you really need is the file size in bytes) in the files.

Almost all of this can be done in parallel, although stage 1 should be completed before the start of phase 2. To improve speed - add kernels - buy a faster disk. There is no memory limit, just a disk.

Added: I just ran some simulation tests on this algorithm using only one thread

1800 groups, 300 users and 15000 members, all randomly generated took about 2.5 minutes. 900 groups, 150 users and 7,500 members took 54 seconds.

+2
source

Regardless of the solution, the complexity depends on the number of generated pairs, not necessarily on the number of groups or individuals. For different group sizes:

  • a group with 10 members produces C (10.2) = 45 pairs.
  • a group with 100 members produces C (100.2) = 4950 pairs.
  • group with 1000 members, 499500 pairs ...
  • With 10,000 members, one group will produce about 50 million pairs! Thus, one group can output the entire cost of the rest of the calculations.

So my first suggestion was to weed out very large groups in the dataset. If you cannot omit large groups and find out that it will not fit into memory, or if it takes a long time to break through it in one thread, you can use Map-Reduce to automatically parallelize the calculations as follows. If you start with group membership, for example:

 32 -> john, jim, bob 21 -> john, jim 

you can use the map step to create all pairs:

 john-jim -> 32, john-bob -> 32, jim-bob -> 32 john-jim -> 21 

They will be combined for you by a pair of names. Then, in the abbreviation, simply count the occurrences of each key. This assumes that you have enough disk to store all pairs.

+1
source

All Articles