Extrapolate Values ​​to Pandas DataFrame

It is very easy to interpolate NaN cells in a Pandas DataFrame:

In [98]: df Out[98]: neg neu pos avg 250 0.508475 0.527027 0.641292 0.558931 500 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN [12 rows x 4 columns] In [99]: df.interpolate(method='nearest', axis=0) Out[99]: neg neu pos avg 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN [12 rows x 4 columns] 

I would also like for him to extrapolate NaN values ​​that are outside the interpolation region using this method. How can i do this?

+8
python pandas extrapolation
source share
2 answers

Extrapolating Pandas DataFrame s

DataFrame can be extrapolated, however in Pandas there is no simple method call and another library is required (for example, scipy.optimize ).

Extrapolation

Extrapolation generally requires some extrapolations to assume data assumptions . One way is to bind a curve to a regular parametrized equation to data to find parameter values ​​that best describe existing data, which are then used to calculate values ​​that go beyond that data. The difficult and limiting problem with this approach is that when choosing a parametrized equation, some assumption about the trend should be made. This can be found through trial and error with different equations to give the desired result, or sometimes it can be inferred from the data source. The data provided in the question are actually not large enough for the data set to obtain a curve of a suitable well; however, this is good enough to illustrate.

Below is an example of extrapolating a DataFrame using a polynomial of order 3 rd

f (x) = ax 3 + bx 2 + cx + d (Eq. 1)

This common function ( func() ) corresponds to a curve for each column to obtain unique column-specific parameters (i.e. a, b, c, d). These parameterized equations are then used to extrapolate the data in each column for all indices with NaN s.

 import pandas as pd from cStringIO import StringIO from scipy.optimize import curve_fit df = pd.read_table(StringIO(''' neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN'''), sep='\s+') # Do the original interpolation df.interpolate(method='nearest', xis=0, inplace=True) # Display result print 'Interpolated data:' print df print # Function to curve fit to the data def func(x, a, b, c, d): return a * (x ** 3) + b * (x ** 2) + c * x + d # Initial parameter guess, just to kick off the optimization guess = (0.5, 0.5, 0.5, 0.5) # Create copy of data to remove NaNs for curve fitting fit_df = df.dropna() # Place to store function parameters for each column col_params = {} # Curve fit each column for col in fit_df.columns: # Get x & y x = fit_df.index.astype(float).values y = fit_df[col].values # Curve fit column and get curve parameters params = curve_fit(func, x, y, guess) # Store optimized parameters col_params[col] = params[0] # Extrapolate each column for col in df.columns: # Get the index values for NaNs in the column x = df[pd.isnull(df[col])].index.astype(float).values # Extrapolate those points with the fitted function df[col][x] = func(x, *col_params[col]) # Display result print 'Extrapolated data:' print df print print 'Data was extrapolated with these column functions:' for col in col_params: print 'f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col]) 

Extrapolating Results

 Interpolated data: neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN Extrapolated data: neg neu pos avg 0 0.411206 0.486983 0.631233 0.509807 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 0.621036 0.969232 0.708464 0.766245 6000 1.197762 2.799529 0.991552 1.662954 8000 3.281869 7.191776 1.702860 4.058855 10000 7.767992 15.272849 3.041316 8.694096 20000 97.540944 150.451269 26.103320 91.365599 30000 381.559069 546.881749 94.683310 341.042883 50000 1979.646859 2686.936912 467.861511 1711.489069 Data was extrapolated with these column functions: f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112 f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870 f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098 f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312 

Plot for avg column

Extrapolated data

Without a large dataset or knowing the data source, this result may be completely wrong, but it should demonstrate the process of extrapolating a DataFrame . The proposed equation in func() will probably need to be played in order to get the correct extrapolation. In addition, no attempt was made to make the code efficient.

Update:

If your index is not numeric, like DatetimeIndex , see this answer for how to extrapolate it.

+10
source share
 import pandas as pd try: # for Python2 from cStringIO import StringIO except ImportError: # for Python3 from io import StringIO df = pd.read_table(StringIO(''' neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 999 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN'''), sep='\s+') print(df.interpolate(method='nearest', axis=0).ffill().bfill()) 

gives

  neg neu pos avg 0 0.508475 0.527027 0.641292 0.558931 250 0.508475 0.527027 0.641292 0.558931 999 0.650000 0.571429 0.653983 0.625137 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 0.619718 0.663158 0.665468 0.649448 6000 0.619718 0.663158 0.665468 0.649448 8000 0.619718 0.663158 0.665468 0.649448 10000 0.619718 0.663158 0.665468 0.649448 20000 0.619718 0.663158 0.665468 0.649448 30000 0.619718 0.663158 0.665468 0.649448 50000 0.619718 0.663158 0.665468 0.649448 

Note. I modified df bit to show how interpolating with nearest is different from df.fillna . (See Line 999.)

I also added a NaN string with index 0 to show that bfill() might also be necessary.

+5
source share

All Articles