Segmented Least Squares

Question

Segmented Least Squares

Give an algorithm that takes a sequence of points in the plane (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) and an integer k as input and returns the best piecewise-linear function f, consisting of not more than k pieces that minimize squared square error. You can assume that you have access to an algorithm that calculates the square error of the sum for one segment through the set of n points in Θ (n) time. The solution should use the time O (n ^ 2k) and O (nk).

Can someone help me with this problem? Thank you very much!

+6

algorithm dynamic-programming

user3019792 Feb 16 '14 at 1:49

source share

2 answers

If you can make the least squares for some segment in n^2 , it is easy to do what you want in n^2 k^2 , with dynamic programming. You can optimize this for only one k .

-1

Thomas ahle Mar 05 '14 at 12:17

source share

denis · Accepted Answer · 2014-03-09T18:23:33+0000

(This is too late for your homework, but hope it is all the same).
Firstly, this is dynamic programming only in python / numpy only for k = 4 to help you understand how dynamic programming works; once you understand that writing a loop for any k should be easy.
In addition, Cost[] is a 2d matrix, the space is O (n ^ 2); see notes at the end for moving to O (nk) space

 #!/usr/bin/env python """ split4.py: min-cost split into 4 pieces, dynamic programming k=4 """ from __future__ import division import numpy as np __version__ = "2014-03-09 mar denis" #............................................................................... def split4( Cost, verbose=1 ): """ split4.py: min-cost split into 4 pieces, dynamic programming k=4 min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n] Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b* or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece -- o-- o- o--- o---- | | | | 0 2 5 9 (Why 4 ? to walk through step by step, then put in a loop) """ # speedup: maxlen 2 n/k or so Cost = np.asanyarray(Cost) n = Cost.shape[1] # C2 C3 ... costs, J2 J3 ... indices of best splits J2 = - np.ones(n, dtype=int) # -1, NaN mark undefined / bug C2 = np.ones(n) * np.NaN J3 = - np.ones(n, dtype=int) C3 = np.ones(n) * np.NaN # best 2-splits of the left 2 3 4 ... for nleft in range( 1, n ): J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft] for j in range( 1, nleft+1 )]) + 1 C2[nleft] = Cost[0,j-1] + Cost[j,nleft] # an idiom for argmin j, min value c together # best 3-splits of the left 3 4 5 ... for nleft in range( 2, n ): J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft] for j in range( 2, nleft+1 )]) + 2 C3[nleft] = C2[j-1] + Cost[j,nleft] # best 4-split of all n -- j4 = np.argmin([ C3[j-1] + Cost[j,n-1] for j in range( 3, n )]) + 3 c4 = C3[j4-1] + Cost[j4,n-1] j3 = J3[j4] j2 = J2[j3] jsplit = np.array([ 0, j2, j3, j4, n ]) if verbose: print "split4: len %s pos %s cost %.3g" % (np.diff(jsplit), jsplit, c4) print "split4: J2 %s C2 %s" %(J2, C2) print "split4: J3 %s C3 %s" %(J3, C3) return jsplit #............................................................................... if __name__ == "__main__": import random import sys import spread n = 10 ncycle = 2 plot = 0 seed = 0 # run this.py a=1 b=None c=[3] 'd = expr' ... in sh or ipython for arg in sys.argv[1:]: exec( arg ) np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True ) np.random.seed(seed) random.seed(seed) print "\n", 80 * "-" title = "Dynamic programming least-square horizontal lines %sn %d seed %d" % ( __file__, n, seed) print title x = np.arange( n + 0. ) y = np.sin( 2*np.pi * x * ncycle / n ) # synthetic time series ? print "y: %s av %.3g variance %.3g" % (y, y.mean(), np.var(y)) print "Cost[j,k] = sum (y - av y)^2 --" # len * var y[j:k+1] Cost = spread.spreads_allij( y ) print Cost # .round().astype(int) jsplit = split4( Cost ) # split4: len [3 2 3 2] pos [ 0 3 5 8 10] if plot: import matplotlib.pyplot as pl title += "\n lengths: %s" % np.diff(jsplit) pl.title( title ) pl.plot( y ) for js, js1 in zip( jsplit[:-1], jsplit[1:] ): if js1 <= js: continue yav = y[js:js1].mean() * np.ones( js1 - js + 1 ) pl.plot( np.arange( js, js1 + 1 ), yav ) # pl.legend() pl.show()

Then the following code does Cost[] only for horizontal lines, slope 0; expanding it to segments of any slope, during O (n), it remains as an exercise.

 """ spreads( all y[:j] ) in time O(n) define spread( y[] ) = sum (y - average y)^2 eg spread of 24 hourly temperatures y[0:24] ie y[0] .. y[23] around a horizontal line at the average temperature (spread = 0 for constant temperature, 24 c^2 for constant + [c -cc -c ...], 24 * variance(y) ) How fast can one compute all 24 spreads 1 hour (midnight to 1 am), 2 hours ... all 24 ? A simpler problem: compute all 24 averages in time O(n): N = np.arange( 1, len(y)+1 ) allav = np.cumsum(y) / N = [ y0, (y0 + y1) / 2, (y0 + y1 + y2) / 3 ...] An identity: spread(y) = sum(y^2) - n * (av y)^2 Voila: the code below, all spreads() in time O(n). Exercise: extend this to spreads around least-squares lines fit to [ y0, [y0 y1], [y0 y1 y2] ... ], not just horizontal lines. """ from __future__ import division import sys import numpy as np #............................................................................... def spreads( y ): """ [ spread y[:1], spread y[:2] ... spread y ] in time O(n) where spread( y[] ) = sum (y - average y )^2 = n * variance(y) """ N = np.arange( 1, len(y)+1 ) return np.cumsum( y**2 ) - np.cumsum( y )**2 / N def spreads_allij( y ): """ -> A[i,j] = sum (y - av y)^2, spread of y around its average for all y[i:j+1] time, space O(n^2) """ y = np.asanyarray( y, dtype=float ) n = len(y) A = np.zeros((n,n)) for i in range(n): A[i,i:] = spreads( y[i:] ) return A

So far, we have had an nxn cost matrix, the space O (n ^ 2). To go into the O (nk) space, carefully look at the Cost[i,j] access pattern in the dyn-prog code:

 for nleft .. to n: Cost_nleft = Cost[j,nleft ] -- time nleft or nleft^2 for k in 3 4 5 ...: min [ C[k-1, j-1] + Cost_nleft[j] for j .. to nleft ]

Here Cost_nleft is one row of the full nxn cost matrix, ~ n segments generated as needed. This can be done in O (n) time for line segments. But if "an error for one segment through a set of n points takes O (n) time", it seems that we are up to O (n ^ 3) time. Anyone comments?

Segmented Least Squares

More articles: