Adjust the curve to the border of the scatter plot

Question

Adjust the curve to the border of the scatter plot

I am trying to fit a curve to the border of a diffused screen. See this image for reference .

I performed the fit with the following (simplified) code already. It cuts the data block into small vertical stripes, and then finds the minimum value in those stripes of width width , ignoring nan s. (The function decreases monotonously.)

 def func(val): """ returns some function of 'val'""" return val * 2 for i in range(0, max_val, width)): _df = df[(df.val > i) & (df.val < i + width)] # vertical slice if np.isnan(np.min(func(_df.val)): # ignore nans continue xs.append(i + width) ys.append(np.min(func(_df.val)))

Then I do a fit with scipy.optimize.curve_fit . My question is: is there a more natural or pythonic way to do this - and is there any way to improve accuracy? (for example, by making the regions of the scattering diagram with a higher density of points heavier?)

+5

python scipy pandas curve-fitting

oscarafone Jun 14 '16 at 0:00

source share

2 answers

Vlas sokolov · Answer 1 · 2016-09-29T00:58:47+0000

I found the problem really interesting, so I decided to give it a try. I don't know about pythonic or natural, but I think I found a more accurate way to fit a dataset like yours using information from every point.

First of all, let it generate random data that looks like the one shown. This part can be easily skipped, I send it just so that the code is complete and reproducible. I used two two-dimensional normal distributions to simulate these superdendensities and sprinkled them with a layer of evenly distributed random points. Then they were added to a linear equation similar to yours, and everything under the line was cut off, and the final result looked like this:

Here is a snippet of code to do this:

 import numpy as np x_res = 1000 x_data = np.linspace(0, 2000, x_res) # true parameters and a function that takes them true_pars = [80, 70, -5] model = lambda x, a, b, c: (a / np.sqrt(x + b) + c) y_truth = model(x_data, *true_pars) mu_prim, mu_sec = [1750, 0], [450, 1.5] cov_prim = [[300**2, 0 ], [ 0, 0.2**2]] # covariance matrix of the second dist is trickier cov_sec = [[200**2, -1 ], [ -1, 1.0**2]] prim = np.random.multivariate_normal(mu_prim, cov_prim, x_res*10).T sec = np.random.multivariate_normal(mu_sec, cov_sec, x_res*1).T uni = np.vstack([x_data, np.random.rand(x_res) * 7]) # censoring points that will end up below the curve prim = prim[np.vstack([[prim[1] > 0], [prim[1] > 0]])].reshape(2, -1) sec = sec[np.vstack([[sec[1] > 0], [sec[1] > 0]])].reshape(2, -1) # rescaling to data for dset in [uni, sec, prim]: dset[1] += model(dset[0], *true_pars) # this code block generates the figure above: import matplotlib.pylab as plt plt.figure() plt.plot(prim[0], prim[1], '.', alpha=0.1, label = '2D Gaussian #1') plt.plot(sec[0], sec[1], '.', alpha=0.5, label = '2D Gaussian #2') plt.plot(uni[0], uni[1], '.', alpha=0.5, label = 'Uniform') plt.plot(x_data, y_truth, 'k:', lw = 3, zorder = 1.0, label = 'True edge') plt.xlim(0, 2000) plt.ylim(-8, 6) plt.legend(loc = 'lower left') plt.show() # mashing it all together dset = np.concatenate([prim, sec, uni], axis = 1)

Now that we have the data and the model, we can brainstorm how to set the edge of the point distribution. Commonly used regression methods, such as the non-linear least squares scipy.optimize.curve_fit , take data values y and optimize the model's free parameters so that the residual values between y and model(x) minimal. Non-linear least squares are an iterative process that at each step tries to change the parameters of the curve at each step in order to improve consistency at each step. Now it’s clear that this is one thing that we don’t want to do, because we want our minimization procedure to allow us to be as far from the best curve (but not too far).

Therefore, instead, consider the following function. Instead of simply returning the remainder, it will also “flip” points over the curve at each iteration step and also take them into account. Thus, there are always more points on the curve than the higher, which leads to a shift of the curve with each iteration! As soon as the lowest points were reached, a minimum of the function was found, as well as the spread edge. Of course, this method assumes that you do not have emissions below the curve - but then your figure does not seem to subject them much.

Here are the functions that implement this idea:

 def get_flipped(y_data, y_model): flipped = y_model - y_data flipped[flipped > 0] = 0 return flipped def flipped_resid(pars, x, y): """ For every iteration, everything above the currently proposed curve is going to be mirrored down, so that the next iterations is going to progressively shift downwards. """ y_model = model(x, *pars) flipped = get_flipped(y, y_model) resid = np.square(y + flipped - y_model) #print pars, resid.sum() # uncomment to check the iteration parameters return np.nan_to_num(resid)

Let's see how it looks above:

 # plotting the mock data plt.plot(dset[0], dset[1], '.', alpha=0.2, label = 'Test data') # mask bad data (we accidentaly generated some NaN values) gmask = np.isfinite(dset[1]) dset = dset[np.vstack([gmask, gmask])].reshape((2, -1)) from scipy.optimize import leastsq guesses =[100, 100, 0] fit_pars, flag = leastsq(func = flipped_resid, x0 = guesses, args = (dset[0], dset[1])) # plot the fit: y_fit = model(x_data, *fit_pars) y_guess = model(x_data, *guesses) plt.plot(x_data, y_fit, 'r-', zorder = 0.9, label = 'Edge') plt.plot(x_data, y_guess, 'g-', zorder = 0.9, label = 'Guess') plt.legend(loc = 'lower left') plt.show()

The most important part above is calling the leastsq function. Make sure that you are careful with the initial guesses - if the guess does not fall into the scatter, the model may not converge properly. Once it is appropriate to guess ...

Voila! The edge perfectly matches the real one.

jjrr · Answer 2 · 2017-05-12T13:15:50+0000

This is an interesting problem that I am also trying to solve (and implement in python)

I think that instead of taking min , it’s more reasonable to take the average value of k -lowest (or k high, depending on the problem) data points, and correspond to the average (you should also check that the set parameters are reliable compared to t21> ) You can find this idea, for example, in additions to this PNAS Document .

Adjust the curve to the border of the scatter plot

More articles: