Spark LinearRegressionWithSGD is very sensitive to function scaling

I have a problem with LinearRegressionWithSGD in Spark MLlib. I used their example to install here https://spark.apache.org/docs/latest/mllib-linear-methods.html (using the Python interface).

In their example, all functions almost scale with an average value of about 0 and a standard deviation of around 1. Now, if I do not scale one of them 10 times, the regression breaks (gives nans or very large coefficients):

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    # UN-SCALE one of the features by a factor of 10
    values[3] *= 10

    return LabeledPoint(values[0], values[1:])

data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)

# Build the model
model = LinearRegressionWithSGD.train(parsedData)

# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label,     model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)

So, I think I need to do a function scaling. If I do pre-scaling, this works (because I will return to scaled functions). However, now I do not know how to get the coefficients in the source space.

from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.feature import StandardScalerModel

# Load and parse the data
def parseToDenseVector(line):
    values = [float(x) for x in line.replace(',', ' ').split(' ')]
    # UN-SCALE one of the features by a factor of 10
    values[3] *= 10
    return Vectors.dense(values[0:])

# Load and parse the data
def parseToLabel(values):
    return LabeledPoint(values[0], values[1:])

data = sc.textFile(spark_home+"data/mllib/ridge-data/lpsa.data")

parsedData = data.map(parseToDenseVector)
scaler = StandardScaler(True, True)
scaler_model = scaler.fit(parsedData)
parsedData_scaled = scaler_model.transform(parsedData)

parsedData_scaled_transformed = parsedData_scaled.map(parseToLabel)

# Build the model
model = LinearRegressionWithSGD.train(parsedData_scaled_transformed)

# Evaluate the model on training data
valuesAndPreds = parsedData_scaled_transformed.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
print "Model coefficients:", str(model)

, . , ? scaler_model, ScalerModel . . , , , . .

+4
3

. f(x) = x, x ( > 3) . .

, . SPARK-1859. :

1 L. GD stepSize = 1/(2L). Spark (1/n) .

, n = 5 , 1500. , L = 1500 * 1500 / 5. stepSize = 1/(2L) = 10 / (1500 ^ 2).

( 2 ?), , . , , , .

+2

, I C_1 C_2, : Y = I + C_1 * x_1 + C_2 * x_2 ( x_1 x_2 ).

I - , mllib. , C_1 C_2 - ( ), mllib.

m_1 - < <24 > m_2 - x_2.

s_1 - x_1 s_2 - x_2.

C_1 = (c_1 / s_1), C_2 = (c_2 / s_2)

I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2

3 :

C_3 = (c_3 / s_3) I = i - c_1 * m_1 / s_1 - c_2 * m_2 / s_2 - c_3 * m_3 / s_3

0

, StandardScalerModel pyspark std . https://issues.apache.org/jira/browse/SPARK-6523

import numpy as np
from pyspark.mllib.stat import Statistics

summary = Statistics.colStats(features)
mean = summary.mean()
std = np.sqrt(features.variance())

This is the same average and std as your Scaler. You can check it with python dict magic

print scaler_model.__dict__.get('_java_model').std()
print scaler_model.__dict__.get('_java_model').mean()
0
source

All Articles