More efficient algorithm for finding ORs of two sets

Question

More efficient algorithm for finding ORs of two sets

Given the matrix of columns n and m columns 1 and 0 ', it is necessary to find out the number of pairs of rows that can be selected so that their OR 11111....m times .

Example:

 1 0 1 0 1 0 1 0 0 1 1 1 1 1 0

Answer:

 2 ---> OR of row number [1,3] and [2,3]

Given n and m can be an order of magnitude up to <= 3000 , how efficiently can this problem be solved?

PS: I already tried with the naive O(n*n*m) method. I was thinking of a better solution.

+7

algorithm binary-operators

yobro97 Jun 03 '17 at 20:47

source share

5 answers

Extension of TheGreatContini idea:

First try

Let's look at this as looking for combinations belonging to AxB with sets of strings A and B. These combinations must satisfy a condition or condition, but we also assume that the cracking weight a is not less than b (to avoid some duplicates).

Now divide A by A0 (lines starting with 0) and A1 (lines starting with 1). Do the same for B. Now we have brought the problem to three smaller problems: A0xB1, A1xB1 and A1xB0. If A and B were the same, then A0xB1 and A1xB0 are the same, so we only need to do this. These three subtasks are not only less combined than the first, we also thoroughly checked the first column and can now ignore it.

To solve these subtasks, we can use the same approach, but now with column 2, 3, ... At some point, we will either check all the columns, or #A and #B will be 1.

Depending on the implementation, faster acceleration may be more efficient. At this point, we can perform an exhaustive check of the remaining combinations. Remember that if we have already checked k columns, it will only cost mk per combination.

Better Column Selection

As TheGreatContini suggested, instead of selecting the first column, we can select the column that will lead to the smallest subtasks. The cost of finding this column at each step is quite high, but the scales can be calculated once at the beginning, and then used as an estimate for the best column. Then we can reorder the columns using the algorithm, as usual, after this is done, rebuild them again.

The exact best column would be the one for which the number of zeros in A times the number of zeros in B is the maximum.

Hamming Weight Trimming

We know that the sum of the weights a and b must be at least m. And since we assumed that this is the highest weight to hold, we can remove all a values that have a storage weight of less than m / 2. (the acceleration it gives may be careless, I'm not sure). The calculation of all weights for interference is O (m * n).

Effective separation

If we sort the rows, grouping can be done much faster using the bisection algorithm. It can also lead to efficient representation of sets in memory. We can simply specify the minimum and maximum lines. Sorting can be done in O (n * m * log (n)). Then the splitting can be done in log (n).

Here is code that will not compile, but should give the right idea.

 private List<Row> rows; public int findFirstOne(int column, int start, int end){ if(rows.get(start).get(column) == 1) return start; if(rows.get(end).get(column) == 0) return -1; while(start < end){ int mid = (start+end)/2; if(rows.get(mid).get(column) == 0){ start = mid+1; }else{ end = mid; } } return start; }

Complexity

In the following calculations, the effect of the best choice of columns is ignored, since it will not improve the efficiency of the worst case. However, in medium cases, this can lead to significant improvement, reducing the search space as soon as possible and thereby speeding up other checks.

The running time of the algorithms is limited to n²m. However, the worst examples I found are all O (n * log (n) * m).

First, the sorting of the matrix will be O (n * log (n) * m) for the rows and, optionally, the sorting of the columns will be O (n * m + m * log (m)).

Then, create subtasks. Reevaluate first. We need to divide no more than 2 times, and the cost of the full level of units at a depth I can be overestimated as log (n) * 3 ^ i (the cost of splitting units at times). This leads to the sum O (log (n) * 3 ^ m).

Also, 3 ^ i <= n² / 2 must be fulfilled, since this is the maximum possible number of combinations, so for large m it overlaps with O (n² * log (n) * m). I'm struggling to find an example that really behaves this way.

I think it is reasonable to assume that many of the subtasks became very trivial very early. Going to O (log (n) * m * n) (if anyone wants to check this out, I'm not sure about that).

+2

Thijs steel Jun 04 '17 at 0:39

source share

Here's the idea of a wall that may have even worse asymptotic (or even average) behavior - but it generalizes in an interesting way and at least offers a different approach. This issue may be considered an exact cover issue . Each of the n rows contains a set of S values from the set {1,2, ..., m} corresponding to the column indices for which the row has a value of 1. The task is to find a collection of rows whose sets form a disjoint partition { 1,2, ... m}. When there are only two such lines in the exact cover, these lines are binary opposites of the type you are looking for. However, more complex exact coverages are possible, such as three lines:

 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0

The exact coating problem is looking for all such exact coatings and is an NP-complete problem. Canonical Solution Algorithm X created by Donald Knuth.

+2

jwimberley Jun 04 '17 at 1:33

source share

If I'm not mistaken, there should be O (n * m):

For each column, calculate the set of row indexes that have a “1” in that column, and save this as a mapping from the column index to the set of row indexes
For each row, compute a set of row indices that can “fill” the row (by adding “1” to the columns where the row has “0”). This can be done by calculating the intersection of the sets that were computed in step 1 for the corresponding column
Count trailing line indices

In your example:

 1 0 1 0 1 0 1 0 0 1 1 1 1 1 0

The row indices that have a “1” in each column are
- Column 0: [0, 2]
- Column 1: [1, 2]
- Column 2: [0, 2]
- Column 3: [2]
- Column 4: [0, 1]
The union of all sets of indexes that are used to “populate” each row, this
- Line 0: [2]
- Line 1: [2]
- Line 2: []

Only 2.

The main reason why you can argue about the running time is that the calculation of the intersections of sets m with size no more than n can be considered as O (m * n), but I think that the sizes of these sets will be limited: the records are either 1 , or 0, and when there are many 1 (and the sizes are large), that is, there are fewer sets for the intersection, and vice versa - but I have not made rigorous proof here ...

The Java implementation I used to play with this (and for some basic "tests"):

 import java.util.ArrayList; import java.util.Arrays; import java.util.LinkedHashMap; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Random; import java.util.Set; public class SetOrCombinations { public static void main(String[] args) { List<Integer> row0 = Arrays.asList(1, 0, 1, 0, 1); List<Integer> row1 = Arrays.asList(1, 1, 0, 0, 1); List<Integer> row2 = Arrays.asList(1, 1, 1, 1, 0); List<Integer> row3 = Arrays.asList(0, 0, 1, 1, 1); List<List<Integer>> rows = Arrays.asList(row0, row1, row2, row3); run(rows); for (int m = 2; m < 10; m++) { for (int n = 2; n < 10; n++) { run(generateRandomInput(m, n)); } } } private static void run(List<List<Integer>> rows) { int m = rows.get(0).size(); int n = rows.size(); // For each column i: // Compute the set of rows that "fill" this column with a "1" Map<Integer, List<Integer>> fillers = new LinkedHashMap<Integer, List<Integer>>(); for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++) { List<Integer> row = rows.get(j); List<Integer> list = fillers.computeIfAbsent(i, k -> new ArrayList<Integer>()); if (row.get(i) == 1) { list.add(j); } } } // For each row, compute the set of rows that could "complete" // the row (by adding "1"s in the columns where the row has // a "0"). int count = 0; Set<Integer> processedRows = new LinkedHashSet<Integer>(); for (int j = 0; j < n; j++) { processedRows.add(j); List<Integer> row = rows.get(j); Set<Integer> completers = new LinkedHashSet<Integer>(); for (int i = 0; i < n; i++) { completers.add(i); } for (int i = 0; i < m; i++) { if (row.get(i) == 0) { completers.retainAll(fillers.get(i)); } } completers.removeAll(processedRows); count += completers.size(); } System.out.println("Count "+count); System.out.println("Ref. "+bruteForceCount(rows)); } // Brute force private static int bruteForceCount(List<List<Integer>> lists) { int count = 0; int n = lists.size(); for (int i = 0; i < n; i++) { for (int j = i + 1; j < n; j++) { List<Integer> list0 = lists.get(i); List<Integer> list1 = lists.get(j); if (areOne(list0, list1)) { count++; } } } return count; } private static boolean areOne(List<Integer> list0, List<Integer> list1) { int n = list0.size(); for (int i=0; i<n; i++) { int v0 = list0.get(i); int v1 = list1.get(i); if (v0 == 0 && v1 == 0) { return false; } } return true; } // For testing private static Random random = new Random(0); private static List<List<Integer>> generateRandomInput(int m, int n) { List<List<Integer>> rows = new ArrayList<List<Integer>>(); for (int i=0; i<n; i++) { List<Integer> row = new ArrayList<Integer>(); for (int j=0; j<m; j++) { row.add(random.nextInt(2)); } rows.add(row); } return rows; } }

+2

Marco13 Jun 04 '17 at 3:29

source share

Here is an algorithm that uses the knowledge that two rows with zero in the same column are automatically disqualified as partners. The fewer zeros in the current line, the less we visit other lines; but the more zeros that we have as a whole, the less we visit other lines.

 create two sets, one with a list of indexes of all rows, and the other empty assign a variable, total = 0

Iterating through each line from right to left, from the bottom line to the top (it can also be in a different order, I just depicted it that way).

 while row i is not the first row: call the non-empty set A and the empty set dont_match remove i, the index of the current row, from A traverse row i: if A is empty: stop the traversal if a zero is encountered: traverse up that column, visiting only rows listed in A: if a zero is encountered: move that row index from A to dont_match the remaining indexes in A point to row partners to row i add their count to total and move the elements from the shorter of A and dont_match to the other set return total

0

גלעד ברקן Jun 04 '17 at 16:08

source share

TheGreatContini · Accepted Answer · 2017-06-03T21:12:03+0000

1. trivial solution A trivial algorithm (which you have already found but not published) is to take all ( n select 2) combinations of n , OR strings and see if it works. This is O(n^2 * m) . The coding will look like this:

 for (i = 0; i < n; ++i) for (j=i+1; j < n; ++ j) { try OR of row[i] with row[j] to see if it works, if so record (i,j) }

2. Constant Acceleration You can improve the runtime in different word sizes by packing bits into words. This still gives the same asymptotics, but in practice, the coefficient of 64-bit acceleration on a 64-bit machine. This has already been noted in the comments above.

3. heuristic acceleration We can do heuristics to further improve time in practice, but without an asymptotic guarantee. Consider sorting your rows by the weight of the jamming, with the smallest weight for breaking in the front and the largest weight for breaking in the end (run time O(m * n * log m ) ). Then you need to compare only rows with low weight with large weights: in particular, the weight should be >= m . Then the search will look something like this:

 for (i = 0; i < n; ++i) for (j=n-1; j > i; --j) /* go backwards to take advantage of hmwt */ { if ( (hmwt(row[i]) + hmwt(row[j])) < m) break; try OR of row[i] with row[j] to see if it works, if so record (i,j) }

4. A Better Approach Another approach that may offer a better return is to choose a low weight column. Then combine the rows into two groups: those that have 1 in this column (group A), and those that have 0 in this column (group B). Then you only need to consider combinations of strings in which one is from group A and the other is from group B, or both from group A (thanks @ruakh for catching my control). Something in this direction should help a lot. But again, this is still asymptotically worse than the worst case, but in practice it should be faster (assuming we are not responsible for all the combinations that are the answer).

5. the limits of what can be done It is easy to construct examples in which the number of pairs of vector pairs that work is O(n^2) , and therefore it is very difficult to beat O(m*n^2 ) in the worst case. What we should look for is a solution that is somehow related to the number of pairs that work. The heuristic above goes in that direction. If there is a column with a small interference weight h , then paragraph 4 above reduces the operating time to O(h*n*m + h^2*m) . If h significantly less than n , you get big improvements.

More efficient algorithm for finding ORs of two sets

More articles: