Compare 10 million people

I need to write a program that compares 10'000'000 + entities against each other. Objects are basically flat lines in the database / csv file.

The comparison algorithm should be quite flexible, it is based on a rule engine in which the end user enters rules, and each object is mapped to any other object.

I’m thinking about how I could divide this task into smaller workloads, but I haven’t found anything yet. Because the rules are entered by the end user, pre-sorting the DataSet seems impossible.

What I'm trying to do now matches all the DataSet data in memory and processes each element. But it is not very effective and requires approx. 20 GB of memory (compressed).

Do you have an idea how I can split the load or reduce its size?

thanks

+8
c # algorithm matching
source share
5 answers

If your rules are at the highest level of abstraction (for example, any unknown comparison function), you cannot achieve your goal. Comparison operations of 10 ^ 14 will be performed for many centuries.

If the rules are not completely general, I see 3 solutions for optimizing various cases:

  • if the comparison is transitive and you can calculate the hash (someone has already recommended this), do it. Hashes can also be complex, not just your rules =). Find a good hash function, and this can help in many cases.

  • If the entities are sorted , sort them. To do this, I would recommend not to sort in place, but to create an array of indices (or identifiers) of elements. If your comparison can be converted to SQL (as I understand it, your data is in the database), you can do it on the DBMS side more efficiently and read the sorted indexes (for example, 3,1,2, which means that the element with ID = 3 is the lowest, with ID = 1 in the middle and with ID = 2 the largest). Then you need to compare only neighboring elements.

  • if something is worth it , I would try using some heuristic sorting or hashing. I would like to create a hash that does not necessarily uniquely identify equal elements, but can split your data set into groups between which there definitely are not a pair of identical elements. Then all equal pairs will be inside the groups, and you can read the groups one by one and perform manual calculations of complex functions in the group, not 10,000,000, but, for example, 100 elements. Another sub-approach is heuristic sorting for the same purpose, to ensure that equal elements are not at different ends of the dataset. After that, you can read the elements one at a time and compare with 1000 previous elements, for example (already read and stored in memory). I would have stored, for example, 1,100 elements and the 100 oldest 100 each time 100 new ones arrive. This would optimize the reading of your database. Another implementation of this may be possible if your rules contain rules such as (Attribute1 = Value1) AND (...) or a rule like (Attribute1 <Value2) AND (...) or any other simple rule. Then you can do the clustering with these criteria first, and then compare the elements in the created clusters.

By the way, what if your rule considers all 10,000,000 elements equal? Would you like to get a pair of 10 ^ 14 results? This case proves that you cannot solve this problem in the general case. Try to make some restrictions and assumptions.

+12
source share

I would try to think of a hierarchy of rules. For example, let's say that rule A is “Color,” and rule “B” is “Form.”

If you first separate the objects by color, than there is no need to compare the red circle with the blue triangle.

This will reduce the number of comparisons you need to complete.

+4
source share

I would create a hash code from each object. You will probably have to exclude the identifier from the hash generation, and then check for equal. If you have a hash, you can order all the hash codes alphabetically. Having all the entities in order means that it is fairly easy to check for duplications.

+1
source share

If you want to compare each entity with all entities than effective, you need to copy the data, there are very few reasons to compare absolutely unrelated things (compare clothes with a person, it makes no sense), I think your data cluster will try.

so you need to group the data, try some clustering algorithms like K-Means .

Also see Apache Mahout

+1
source share

Are you looking for the best suitable sorting algorithm like a for this? I think Divide and Concur seem good. If the algorithm seems good, you can have many other calculation methods. Specially parallel processing using MPICH or something can give you the final destination.

But before deciding how to execute, you need to think about whether the algorithm is suitable.

-one
source share

All Articles