Prediction from a previous date: cost data

I have several data sets for similar time periods. This is a presentation of people that day, a period of about a year. Data is not collected at regular intervals, but rather random: 15-30 records for each year, starting from 5 years.

The graph taken from the data for each year looks something like this: graph Graph made using matplotlib. I have data in datetime.datetime, int format.

Is there any reasonable way to predict how things will be in the future? My initial thought was to calculate the average of all previous events and predict that this would be so. This, however, does not take into account any data for the current year (if it was above average for the entire time, the assumption should be slightly higher).

The data set and my knowledge of statistics are limited, so each understanding is helpful.

My goal would be to first create a prototype solution to check if my data is enough for what I'm trying to do, and after a (potential) check, I would try a more refined approach.

Edit: Unfortunately, I did not have the opportunity to try the answers I received! I am still interested, although such data would be sufficient, and I will remember this if I have such an opportunity. Thanks for all the answers.

+8
python algorithm statistics prediction
source share
2 answers

In your case, the data changes rapidly, and you have immediate observations of the new data. Fast prediction can be realized using Holt-winter exponential smoothing.

Update Equations:

enter image description here

m_t is the data that you have, for example, the number of people at each point in time t . v_t is the first derivative, i.e. trend m . alpha and beta are two decay parameters. A variable with tilde top indicates the predicted value. Check the algorithm details on the wikipedia page.

Since you are using python , I can show you sample code that will help you with the data. BTW, I use some synthetic data as shown below:

 data_t = range(15) data_y = [5,6,15,20,21,22,26,42,45,60,55,58,55,50,49] 

Above data_t is a sequence of consecutive data points, starting at time 0; data_y is a sequence of the observed number of people in each presentation.

The data looks below (I tried to make it close to your data).

enter image description here

The algorithm code is simple.

 def holt_alg(h, y_last, y_pred, T_pred, alpha, beta): pred_y_new = alpha * y_last + (1-alpha) * (y_pred + T_pred * h) pred_T_new = beta * (pred_y_new - y_pred)/h + (1-beta)*T_pred return (pred_y_new, pred_T_new) def smoothing(t, y, alpha, beta): # initialization using the first two observations pred_y = y[1] pred_T = (y[1] - y[0])/(t[1]-t[0]) y_hat = [y[0], y[1]] # next unit time point t.append(t[-1]+1) for i in range(2, len(t)): h = t[i] - t[i-1] pred_y, pred_T = holt_alg(h, y[i-1], pred_y, pred_T, alpha, beta) y_hat.append(pred_y) return y_hat 

Well, now let our predictor and draw the predicted result against the observations:

 import matplotlib.pyplot as plt plt.plot(data_t, data_y, 'x-') plt.hold(True) pred_y = smoothing(data_t, data_y, alpha=.8, beta=.5) plt.plot(data_t[:len(pred_y)], pred_y, 'rx-') plt.show() 

Red shows the result of the prediction at each point in time. I set alpha to 0.8, so the last observation will greatly affect the following prediction. If you want to increase the history data, just play with the alpha and beta options. Also note that the best data point on the red line at t=15 is the last prediction in which we have no observation so far.

BTW, this is far from an ideal prediction. This is where you can start quickly. One of the drawbacks of this approach is that you should be able to receive observations, otherwise the prediction will be more and more (this is probably true for all real-time predictions). Hope it helps.

enter image description here

+12
source share

Forecasting is difficult. You might want to try polynomial extrapolation , but the score will increase significantly as you get farther away from the β€œknown” area .

Another possible solution is to use machine learning algorithms, but this requires the collection of a large amount of data.

Extract functions from your data (for example, a function is the number of records in one day). And train the algorithm. (Give him the distant past data, as well as, for example, the present as a predicted field).

I do not know about python, but in java there is an open source library called weka that implements most of the functions and the algorithm used for machine learning.

You can evaluate how accurate this method uses cross validation later.


With that said, this problem is usually referred to as trend detection, and now it is a hot field in the study, so there is no silver bullet .

+4
source share

All Articles