How to make RMSE (root mean square error) small when using ALS spark?

I need suggestions on creating a good model to make a recommendation using Collaborative Filtering sparks. On the official website from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating # Load and parse the data data = sc.textFile("data/mllib/als/test.data") ratings = data.map(lambda l: l.split(','))\ .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations) # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5) print("Root Mean Squared Error = " + str(RMSE))

A good model needs RMSE as little as possible.

Is it because I am not setting the correct parameter for the ALS.train method like rand numIterations etc.?

Or is it because my dataset is small to make RMSE large?

So can someone help me figure out what the cause of RMSE is great and how to fix it.

addition:

As @eliasah said, I need to add some details to narrow down the answer set. Consider this specific situation:

Now, if I want to create a recommendation system to recommend music to my clients. I have their story for tracks, albums, artists and genres. Obviously, this class 4 creates a hierarchical structure. Tracks belong directly to albums, albums belong to artists, and artists can belong to several different genres. Finally, I want to use all this data to select some tracks to recommend to customers.

So, what is the best practice of building a good model for this situation and ensuring that the RMSE is as small as possible for forecasting.

+7
collaborative-filtering apache-spark pyspark apache-spark-mllib
source share
3 answers

As you mentioned above, as rank and number increase, RMSE decreases with the same dataset. However, as the dataset grows, the RMSE increases .

Now, one practice taken to reduce RMSE and some other similar measures is to normalize the ratings . In my experience, this works very well when you know the minimum and maximum ratings in advance.

In addition, you should also consider using measures other than RMSE. When you do matrix factorization, what I found useful is to calculate the Frobenius norm of ratings - the forecasts are then divided by the Frobenius Norm of ratings. . In this case, you get a relative error in your forecasts regarding the original estimates.

Here is the spark code for this method:

 # Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum()))) # frobenius error of original ratings frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum()) # finally, the relative error rel_error = abs_frobenius_error/frob_error_orig print("Relative Error = " + str(rel_error)) 

In this error measurement, the closer the error is to zero, the better the model.

Hope this helps.

+3
source share

I am a little versed in this, here is the conclusion:

When rand and iteration increase, RMSE will decrease. However, as the size of the data set grows, the RMSE will increase. Because of the above, the size of rand will significantly change the value of RMSE.

I know that this is not enough to get a good model. Wish more ideas !!!

+1
source share

In pyspark, use this to find the root mean square error of Root (rmse)

 from pyspark.mllib.recommendation import ALS from math import sqrt from operator import add # rank is the number of latent factors in the model. # iterations is the number of iterations to run. # lambda specifies the regularization parameter in ALS rank = 8 num_iterations = 8 lmbda = 0.1 # Train model with training data and configured rank and iterations model = ALS.train(training, rank, num_iterations, lmbda) def compute_rmse(model, data, n): """ Compute RMSE (Root Mean Squared Error), or square root of the average value of (actual rating - predicted rating)^2 """ predictions = model.predictAll(data.map(lambda x: (x[0], x[1]))) predictions_ratings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \ .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \ .values() return sqrt(predictions_ratings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n)) print "The model was trained with rank = %d, lambda = %.1f, and %d iterations.\n" % \ (rank, lmbda, num_iterations) # Print RMSE of model validation_rmse = compute_rmse(model, validation, num_validation) print "Its RMSE on the validation set is %f.\n" % validation_rmse 
0
source share

All Articles