This is where the dynamic programming approach should work.
Suppose first that a cost(clustering) is equal to the sum of cost(cluster) for all the clusters that make up the clustering.
Then a simple DP function is defined as follows:
F[i] = minimal cost of clustering the substring s[0:i]
and calculated as follows:
for i = 0..length(s)-1: for j = a..b: last_cluster = s[ij..i] F[i] = min(F[i], F[i - j] + cost(last_cluster))
Of course, you must first initialize the values ββof F for some infinite values ββor zeros in order to correctly apply the min function.
To actually restore the answer, you can save additional values ββof P[i] , which will contain the lengths of the last cluster with optimal clustering of the string s [0..i]. When you update F [i], you also update P[i] .
Then restoring the answer is a small problem:
current_pos = length(s) - 1 while (current_pos >= 0): current_cluster_length = P[current_pos] current_cluster = s[(current_pos - current_cluster_length + 1)..current_pos]
Note that in this approach, you will get clsuters in reverse order, i.e. from the last cluster to the first.
Now apply this idea to the original problem. We would like the cost (clustering) to be more or less linear, so we can calculate it with a cluster by a cluster, and not calculate it for the entire clustering.
The first parameter of our DP F function will be, as before, i , the number of characters in the substring s[0:i] , we found the optimal answer. The meaning of the function F is, as usual, the minimum cost that we can achieve with the given parameters.
The parameter e = mean(clusters_entropies) the cost function is already linear and can be calculated by the cluster by the cluster, so this is not a problem.
The l = variance(cluster_lengths) bit more complicated. The deviation of n defined as Sum[(x[i] - mean)^2] / n . mean - the expected value, namely mean = Sum[x[i]] / n . Note also that Sum[x[i]] is the sum of the lengths of all clusters, and in our case it is always fixed and equal to length(s) . Therefore, mean = length(s) / n . Well, we more or less made our function l the cost function linear, except for the parameter n . We will add this parameter, namely the number of clusters in the desired clustering, as a parameter of our function F We will also have the cur parameter, which will indicate the number of clusters currently collected in this state.
The parameter d the cost function also requires the addition of an additional parameter to our DP-function F , namely j , sz , the size of the last cluster in our section.
In general, we proposed the DP F[i][n][cur][sz] function, which gives us the function of the minimum cost of splitting the string s[0:i] into n clusters, of which cur is currently built with the size of the last cluster equal to sz . Of course, our responsibility is to make sure that a<=sz<=b . The answer in terms of the minimum cost function will be minimal among all possible values ββof n and a<=sz<=b function DP F[length(s)-1][n][n][sz] . Now notice that this time we donβt even need the companion function P to store the length of the last cluster, since we already included this information as the last parameter sz in our function F However, we will save in P[i][n][cur][sz] length of the next to the last cluster in optimal clustering with the specified parameters. We will use this value to restore our solution. Thus, we can restore the answer as follows, assuming that the minimum F achieved in the parameters n=n0 and sz=sz0 :
current_pos = length(s) - 1 current_n = n0 current_cluster_size = sz0 while (current_n > 0): current_cluster = s[(current_pos - current_cluster_size + 1)..current_pos] next_cluster_size = P[current_pos][n0][current_n][current_cluster_size] current_n--; current_pos -= current_cluster_size; current_cluster_size = next_cluster_size
Now we proceed to the calculation of F I will omit angular cases and range checks, but this is enough to just initialize F with some infinite values.
// initialize for the case of one cluster // d = 0, l = 0, only have to calculate entropy for i=0..length(s)-1: for n=1..length(s): F[i][n][1][i+1] = cluster_entropy(s[0..i]); P[i][n][1][i+1] = -1; // initialize with fake value as in this case there is no previous cluster // general case computation for i=0..length(s)-1: for n=1..length(s): for cur=2..n: for sz=a..b: for prev_sz=a..b: cur_cluster = s[i-sz+1..i] prev_cluster = s[i-sz-prev_sz+1..i-sz] F[i][n][cur][sz] = min(F[i][n][cur][sz], F[i-sz][n][cur - 1][prev_sz] + gamma*calc_d(prev_cluster, cur_cluster) + beta*cluster_entropy(cur_cluster)/n + alpha*(sz - s/n)^2)