Machine Learning: normalize target var based on the influence of independent var

I have a data set that contains information about the driver’s trip, as described below. My goal is to come up with a new mileage or an adjusted mileage that takes into account the load the driver is carrying and the car he is driving. Because we found that there is a negative correlation between mileage and load. That way, the more load you have, the less mileage you can get. In addition, the type of car can affect your performance. In a sense, we are trying to normalize the mileage so that the driver, who was given a heavy load and received less mileage, cannot be punished on the run because of this. So far, I have used linear regression and correlation to see the relationship between mileage and the load the driver bears. The correlation was -.6. The dependent variable is Miles per Gal, and the independent variables are load and Vehicle.

Drv Miles per Gal Load(lbs) Vehicle A 7 1500 2016 Tundra B 8 1300 2016 Tundra C 8 1400 2016 Tundra D 9 1200 2016 Tundra E 10 1000 2016 Tundra F 6 1500 2017 F150 G 6 1300 2017 F150 H 7 1400 2017 F150 I 9 1300 2017 F150 J 10 1100 2017 F150 

The results may be like this.

 Drv Result-New Mileage A 7.8 B 8.1 C 8.3 D 8.9 E 9.1 F 8.3 G 7.8 H 8 I 8.5 J 9 

Until now, I am a little skeptical about how to use the slopes from LR to normalize these indicators. Any other feedback on the approach would be helpful.

Our ultimate goal is to rank drivers based on miles per gallon, taking into account the effects of load and vehicle.

Thanks jay

+7
python statistics machine-learning correlation linear-regression
source share
2 answers

There can be many ways to "normalize points", and the best one will depend heavily on what exactly you are trying to achieve (which is not clear from the question). However, having said that, I would like to suggest a simple, practical approach.

Starting from the utopian case: let's say you had a lot of data, all this is completely linear, i.e. showing a neat linear relationship between load and MPG per vehicle type. In this case, you will have a strong prediction regarding the expected MPG per vehicle type, given some load. You can compare the actual MPG with the expected value and “score” based on the relationship, for example. actual MPG / expected MPG.

In practice, however, the data is never perfect. Thus, you can build a model based on available data, get a forecast, but instead of using point estimation as the basis for scoring, you can use a confidence interval. For example: the expected MPG with a model and some load between 9-11 MPG with a confidence of 95%. In some cases (when more data is available or it is more linear), the confidence interval may be narrow; in others, it will be wider.

Then you can take action (for example, “punish”, as you put it), say, only if the MPG is outside the expected range.

EDIT: illustration (code in R):

 #df contains the data above. #generate a linear model (note that 'Vehicle' is not numerical) md <- lm(data=df, Miles.per.Gal ~ Load + Vehicle) #generate predictions based on the model; for this illustration, plotting only for 'Tundra' newx <- seq(min(df$Load), max(df$Load), length.out=100) preds_df <- as.data.frame(predict(md, newdata = data.frame(Load=newx, model="Tundra")) #plot # fit + confidence plt <- ggplot(data=preds_df) + geom_line(aes(x=x, y=fit)) + geom_ribbon(aes(x = x, ymin=lwr, ymax=upr), alpha=0.3) # points for illustration plt + geom_point(aes(x=1100, y=7.8), color="red", size=4) +geom_point(aes(x=1300, y=4), color="blue", size=4) + geom_point(aes(x=1400, y=9), color="green", size=4) 

enter image description here

Thus, based on these data, the fuel consumption of the red driver (7.8 MPG with a load of 1100) is significantly worse than expected, blue (9 MPG with a load of 1300) is in the expected range, and the green driver (9 MPG with 1400) is better than MPG, than expected. Of course, depending on the amount of data that you have and good fit, you can use more complex models, but the idea may remain the same.

EDIT 2: fix the mix between green and red (since a higher MPG is better, not worse)

In addition, repeat the question in the comments regarding “scoring” drivers, a reasonable scheme may be to either use the ratio compared to the predicted point, or maybe even better to normalize it by the standard deviation (i.e., different from the expected in stdev units). So, for example, in the above example, a driver 10% above a line with a load of 1250 will have a better score than a driver 10% above a line with a load of 1500, because there is more uncertainty (therefore, 10% is closer to the range of "expected").

+4
source share

The term you are looking for is Decorrelation . You are trying to decorate MPG and Load. One approach to this is to prepare a linear model, as you did, and subtract the predictions of this model from the original MPG values, thereby eliminating the influence of the load (in accordance with the linear model). A Wikipedia article lists this as "Linear Predictive Coders." If you want a fantasy, you can try the same idea with more complex models, if you think that MPG and Load are actually not linearly related.

+1
source share

All Articles