How to skip your file system.
For each entry, open a file with a name for the group identifier and add a new username. You will receive one file for each group.
Now you have - for example:
Group-21.txt jim john Group-32.txt bob jim john
Now run all the files, creating each pair of user names in it (I would sort the names and perform the standard combination process on them). For each pair, add “1” to the file with a specific name.
Now you have - for example:
User-jim-john.txt 11 User-bob-jim.txt 1 User-bob-john.txt 1
Now you have pairs in file names and counts (in unal so that all you really need is the file size in bytes) in the files.
Almost all of this can be done in parallel, although stage 1 should be completed before the start of phase 2. To improve speed - add kernels - buy a faster disk. There is no memory limit, just a disk.
Added: I just ran some simulation tests on this algorithm using only one thread
1800 groups, 300 users and 15000 members, all randomly generated took about 2.5 minutes. 900 groups, 150 users and 7,500 members took 54 seconds.
source share