There can be many ways to "normalize points", and the best one will depend heavily on what exactly you are trying to achieve (which is not clear from the question). However, having said that, I would like to suggest a simple, practical approach.
Starting from the utopian case: let's say you had a lot of data, all this is completely linear, i.e. showing a neat linear relationship between load and MPG per vehicle type. In this case, you will have a strong prediction regarding the expected MPG per vehicle type, given some load. You can compare the actual MPG with the expected value and “score” based on the relationship, for example. actual MPG / expected MPG.
In practice, however, the data is never perfect. Thus, you can build a model based on available data, get a forecast, but instead of using point estimation as the basis for scoring, you can use a confidence interval. For example: the expected MPG with a model and some load between 9-11 MPG with a confidence of 95%. In some cases (when more data is available or it is more linear), the confidence interval may be narrow; in others, it will be wider.
Then you can take action (for example, “punish”, as you put it), say, only if the MPG is outside the expected range.
EDIT: illustration (code in R):

Thus, based on these data, the fuel consumption of the red driver (7.8 MPG with a load of 1100) is significantly worse than expected, blue (9 MPG with a load of 1300) is in the expected range, and the green driver (9 MPG with 1400) is better than MPG, than expected. Of course, depending on the amount of data that you have and good fit, you can use more complex models, but the idea may remain the same.
EDIT 2: fix the mix between green and red (since a higher MPG is better, not worse)
In addition, repeat the question in the comments regarding “scoring” drivers, a reasonable scheme may be to either use the ratio compared to the predicted point, or maybe even better to normalize it by the standard deviation (i.e., different from the expected in stdev units). So, for example, in the above example, a driver 10% above a line with a load of 1250 will have a better score than a driver 10% above a line with a load of 1500, because there is more uncertainty (therefore, 10% is closer to the range of "expected").