Analysis of various sets and optimizations. The best approach?

Question

Analysis of various sets and optimizations. The best approach?

Over the past few days, I tried to accomplish the following task of analyzing a set of objects, and the solutions I came up with rely heavily on memory (in some cases they get OutOfMemory exceptions) or take an incredibly long time to process. Now I think it is a good idea to post it here, as I have no ideas. I will talk in detail about the problem and provide the logic that I have followed so far.

Scenario:

First, we have an object, which we will call Individual , which contains the following properties:

date of
Longitude Latitude

Secondly, we have another object, which we will call Group , the definition of which is: A set of persons who, together, meet the following conditions:

All people in the set have a date that within each other does not exceed 10 days. This means that all Persons, when compared with each other, do not differ from each other within 10 days.
The distance between each object is less than Y meters.

There can be N> 1 people in a group if each of the individuals corresponds to the conditions within each other. All people are stored in a database. All groups will also be stored in the database.

A task:

Now consider a new person. The system should check if there is a new person:

Belongs to an existing group or groups.
Now a person forms one or more new groups with other persons.

Notes:

A new user can be in several existing groups or can create several new groups.
Subgroups of individuals are not allowed, for example, if we have a group containing the person {A, B, C}, there cannot be a group containing {A, B}, {A, C} or {B, C}.

Solution (limited by processing time and memory)

First, we filter the database with all Individuals that match the initial conditions. This will lead to the FilteredIndividuals enumeration containing all the Personalities that, as we know, form a group (of 2) with a new one.

In short, a Powerset is a collection containing all possible subsets of a specific collection. For example, the set of permissions {A, B, C} would be: {[empty], A, B, C, AB, AC, BC, ABC}

Note. The power set will produce a new set with 2 ^ N combinations, where N is the length of the original set.

The idea of using power plants is as follows:

First we create a set of FilteredIndividuals list options. This will give all possible combinations of groups in the FilteredIndividuals list. For the purposes of analysis and by definition, we can omit all combinations in which there are less than two people.
We check whether each combination of conditions matches each other. If they coincide, this means that all Persons in this combination form a group with a new Individual. Then, to avoid SubGroups, we can eliminate all subsets containing combinations of the Proven combination. I do this by creating a power circuit for a proven combination and then removing the new power set from the original one.
At this stage, we have a list of sets that match the conditions for creating the group.

Before formally creating a group, I compare the database with other existing groups that contain the same elements as the new sets: If I find a match, I delete the newly created set and add a new person to the old group. If I do not find a match, then these are new groups. So I add a new Individual to the set and finally create new groups.

This solution works well when the FilteredIndividuals enumeration has less than 52 individuals. After that, memory exceptions are thrown (I know that this is due to the maximum size allowed for data types, but an increment of this size does not help with very large sets. For your consideration, the maximum number of individuals that meet the conditions I'm found: 345).

Note. I have access to the definition of both objects. If there is a new property that would reduce processing time, we can add it.

I use the .NET framework with C #, but if the language is what needs to be changed, we can accept this if we can later convert the results to an object that our main system understands.

+6

set algorithm combinations powerset

Rainiermallol Jul 27 '16 at 16:16

source share

1 answer

LSerni · Answer 1 · 2016-09-05T21:12:59+0000

All people in the set have a date that should not exceed 10 days. This means that all Persons, when compared with each other, do not differ from each other within 10 days. The distance between each object is less than Y meters.

So, your problem is how to group these points in three-dimensional space , partitioning , where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is a correspondingly scaled version of the Manhattan distance . In particular, you scale Z so that 10 * Z days equals the maximum distance of Y meters.

One possible shortcut would be to use divide et impera and classify your points (Individuals) in buckets, Y-meters wide and 10 days. You do this by dividing their coordinates by Y and by 10 days (you can use Julian dates for this). If an individual is in a bucket H (X = 5, Y = 3, Z = 71}, then he cannot be anything else in buckets with X <(5-1) or X> (5 + 1), Y <(3 -1) or Y> (3 + 1), or Z <(71-1) or Z> (71 + 1), is in the same group, because their distance will certainly be above the threshold, which means that you can quickly select a subset of 27 "buckets" and worry only with those who are there.

At this point, you can list the possible groups your new person might be in (if you use the back end of the database, they will be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( @bucketsId ) ) and compare them with a group that your person can form in other individuals ( SELECT individuals.id WHERE bucketId IN ( @bucketsId ) AND (( x-@newX )*( x-@newX )+( y-@newY )*( y-@newY )) < @YSquared AND ABS(z - @newZ) < 10) ).

This approach is not very efficient (it depends on the database, and you want the index on bucketId to be at least), but it has the advantage of using as little memory as possible.

In some databases with geographic extensions, you can use your own latitude and longitude functions instead of implicit conversion to counters.

Analysis of various sets and optimizations. The best approach?

More articles: