As you mentioned above, as rank and number increase, RMSE decreases with the same dataset. However, as the dataset grows, the RMSE increases .
Now, one practice taken to reduce RMSE and some other similar measures is to normalize the ratings . In my experience, this works very well when you know the minimum and maximum ratings in advance.
In addition, you should also consider using measures other than RMSE. When you do matrix factorization, what I found useful is to calculate the Frobenius norm of ratings - the forecasts are then divided by the Frobenius Norm of ratings. . In this case, you get a relative error in your forecasts regarding the original estimates.
Here is the spark code for this method:
# Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum()))) # frobenius error of original ratings frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum()) # finally, the relative error rel_error = abs_frobenius_error/frob_error_orig print("Relative Error = " + str(rel_error))
In this error measurement, the closer the error is to zero, the better the model.
Hope this helps.
jtitusj
source share