Find many lines of maximum size that are far apart

Given the set S of binary strings of the same length, what is a quick way to find a subset S 'of maximum size with the property that the Hamming distance between each pair of strings in S' is at least d?

The following function calculates the Hamming distance between two lines.

def hamdist(str1, str2): """Count the # of differences between equal length strings str1 and str2""" if (len(str1) != len(str2)): print str1, str2, "Length mismatch!" quit() diffs = 0 for ch1, ch2 in itertools.izip(str1, str2): if ch1 != ch2: diffs += 1 return diffs 
+4
source share
2 answers

As Falk notes, to prove that this problem is NP-hard, we need a reduction from NP-hard problem. I am going to use the problem of finding an independent set (i.e., finding a clique in the complement graph).

My reduction has two stages. The output of the first stage is a generic instance with a non-binary alphabet. At the output of the second stage, the binary alphabet is used. Let G = (V, E) be the graph in which we try to find a large independent set. The output signal of the first stage is | V | words of length | E |. Let e ​​= (v, w) be the edge i th . The letters in i th position of each word are different, with the exception of words corresponding to v and w, which contain a letter there. Thus, the size of the alphabet is | V | - 1. Two vertices are not adjacent if and only if their words are at the maximum distance, so we set the distance threshold to | V |.

In the second stage, we replace each letter with one of | V | - 1 word of length | V | - 1, which has 1 "1" and (| V | - 2) "0" s and doubles the distance threshold to 2 | V |.

To allow small instances, I would recommend using Sam's shortcut to the maximum click problem and run the exponential-time Bron - Kerbosch algorithm to list all the maximum clicks from which you can choose the maximum. (Why B - K, and not faster algorithms? The latter are more complex and will not reach you very far.)

+2
source

To expand Yegor’s comment:

Imagine you have a graph G that has one vertex for each row in S Now, for each pair of vertices v1 , v2 find the Hamming distance between the corresponding rows s1 , s2 . If it is greater than d , add an edge to G between v1 and v2 .

Now you want to find the maximum subset S such that each pair of rows in this maximum subset has a Hamming distance of at least d between them. The corresponding task on the constructed graph is to find the maximum set of vertices such that every vertex in this set is connected to every other vertex in this set.

This is a maximum click problem . If you follow this link to the wikipedia article, you will find that the most famous algorithm for solving it works in O(1.2599^n) time, which is exponential and therefore not fast. If you could quickly solve your problem (that is, in polynomial time), then you could solve the problem of maximum click in polynomial time using this correspondence, so there is no quick solution to your problem.

+3
source

All Articles