Choose the best cluster partition based on cost function

Question

Choose the best cluster partition based on cost function

I have a line that I would like to copy:

s = 'AAABBCCCCC'

I do not know in advance how many clusters I will receive. All I have is a cost function that can take clustering and give it an estimate.

There is also a restriction on cluster sizes: they must be in the range [a, b]

In my example, for a=3 and b=4 , all possible clustering:

 [ ['AAA', 'BBC', 'CCCC'], ['AAA', 'BBCC', 'CCC'], ['AAAB', 'BCC', 'CCC'], ]

The concatenation of each clustering must contain the string s

The cost function is something like this

 cost(clustering) = alpha*l + beta*e + gamma*d

Where:

l = variance(cluster_lengths)
e = mean(clusters_entropies)
d = 1 - nb_characters_in_b_that_are_not_in_a)/size_of_b (for b sequential cluster a )
alpha , beta , gamma are weights

This cost function gives a low cost (0) for the best case:

Where all clusters are the same size.
The content within each cluster is the same.
Serial clusters do not have the same content.

Theoretically, the solution is to calculate the cost of all possible compositions for this row and select the lowest. but it will take too much time.

Is there any clustering algorithm that can find the best clustering according to this cost function in a reasonable amount of time?

+6

algorithm machine-learning

Ghilas belhadj Aug 19 '16 at 11:07

source share

1 answer

Alexey subach · Accepted Answer · 2016-08-23T19:13:03+0000

This is where the dynamic programming approach should work.

Suppose first that a cost(clustering) is equal to the sum of cost(cluster) for all the clusters that make up the clustering.

Then a simple DP function is defined as follows:

F[i] = minimal cost of clustering the substring s[0:i]

and calculated as follows:

 for i = 0..length(s)-1: for j = a..b: last_cluster = s[ij..i] F[i] = min(F[i], F[i - j] + cost(last_cluster))

Of course, you must first initialize the values of F for some infinite values or zeros in order to correctly apply the min function.

To actually restore the answer, you can save additional values of P[i] , which will contain the lengths of the last cluster with optimal clustering of the string s [0..i]. When you update F [i], you also update P[i] .

Then restoring the answer is a small problem:

 current_pos = length(s) - 1 while (current_pos >= 0): current_cluster_length = P[current_pos] current_cluster = s[(current_pos - current_cluster_length + 1)..current_pos] // grab current_cluster to the answer current_pos -= current_cluster_length

Note that in this approach, you will get clsuters in reverse order, i.e. from the last cluster to the first.

Now apply this idea to the original problem. We would like the cost (clustering) to be more or less linear, so we can calculate it with a cluster by a cluster, and not calculate it for the entire clustering.

The first parameter of our DP F function will be, as before, i , the number of characters in the substring s[0:i] , we found the optimal answer. The meaning of the function F is, as usual, the minimum cost that we can achieve with the given parameters.

The parameter e = mean(clusters_entropies) the cost function is already linear and can be calculated by the cluster by the cluster, so this is not a problem.

The l = variance(cluster_lengths) bit more complicated. The deviation of n defined as Sum[(x[i] - mean)^2] / n . mean - the expected value, namely mean = Sum[x[i]] / n . Note also that Sum[x[i]] is the sum of the lengths of all clusters, and in our case it is always fixed and equal to length(s) . Therefore, mean = length(s) / n . Well, we more or less made our function l the cost function linear, except for the parameter n . We will add this parameter, namely the number of clusters in the desired clustering, as a parameter of our function F We will also have the cur parameter, which will indicate the number of clusters currently collected in this state.

The parameter d the cost function also requires the addition of an additional parameter to our DP-function F , namely j , sz , the size of the last cluster in our section.

In general, we proposed the DP F[i][n][cur][sz] function, which gives us the function of the minimum cost of splitting the string s[0:i] into n clusters, of which cur is currently built with the size of the last cluster equal to sz . Of course, our responsibility is to make sure that a<=sz<=b . The answer in terms of the minimum cost function will be minimal among all possible values of n and a<=sz<=b function DP F[length(s)-1][n][n][sz] . Now notice that this time we don’t even need the companion function P to store the length of the last cluster, since we already included this information as the last parameter sz in our function F However, we will save in P[i][n][cur][sz] length of the next to the last cluster in optimal clustering with the specified parameters. We will use this value to restore our solution. Thus, we can restore the answer as follows, assuming that the minimum F achieved in the parameters n=n0 and sz=sz0 :

 current_pos = length(s) - 1 current_n = n0 current_cluster_size = sz0 while (current_n > 0): current_cluster = s[(current_pos - current_cluster_size + 1)..current_pos] next_cluster_size = P[current_pos][n0][current_n][current_cluster_size] current_n--; current_pos -= current_cluster_size; current_cluster_size = next_cluster_size

Now we proceed to the calculation of F I will omit angular cases and range checks, but this is enough to just initialize F with some infinite values.

 // initialize for the case of one cluster // d = 0, l = 0, only have to calculate entropy for i=0..length(s)-1: for n=1..length(s): F[i][n][1][i+1] = cluster_entropy(s[0..i]); P[i][n][1][i+1] = -1; // initialize with fake value as in this case there is no previous cluster // general case computation for i=0..length(s)-1: for n=1..length(s): for cur=2..n: for sz=a..b: for prev_sz=a..b: cur_cluster = s[i-sz+1..i] prev_cluster = s[i-sz-prev_sz+1..i-sz] F[i][n][cur][sz] = min(F[i][n][cur][sz], F[i-sz][n][cur - 1][prev_sz] + gamma*calc_d(prev_cluster, cur_cluster) + beta*cluster_entropy(cur_cluster)/n + alpha*(sz - s/n)^2)

Choose the best cluster partition based on cost function

More articles: