I plan to use linear regression in Spark. To start, I checked an example from the official documentation ( which you can find here )
I also found https://stackoverflow.com/a/126907/ ... which is essentially the same question as mine. The answer suggests resizing the step, which I also tried to take, however the results are still as random as without resizing the step. The code I use is as follows:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
The results are as follows:
(Expected Label, Predicted Label) (-0.4307829, -0.7824231588143065) (-0.1625189, -0.6234287565006766) (-0.1625189, -0.41979307020176226) (-0.1625189, -0.6517649080382241) (0.3715636, -0.38543073492870156) (0.7654678, -0.7329426818746223) (0.8544153, -0.33273378445315) (1.2669476, -0.36663240056848917) (1.2669476, -0.47541427992967517) (1.2669476, -0.1887811811672498) (1.3480731, -0.28646712964591936) (1.446919, -0.3425075015127807) (1.4701758, -0.14055275401870437) (1.4929041, -0.06819303631450688) (1.5581446, -0.772558163357755) (1.5993876, -0.19251656391040356) (1.6389967, -0.38105697301968594) (1.6956156, -0.5409707504639943) (1.7137979, 0.14914490255841997) (1.8000583, -0.0008818203337740971) (1.8484548, 0.06478505759587616) (1.8946169, -0.0685096804502884) (1.9242487, -0.14607596025743624) (2.008214, -0.24904211817187422) (2.0476928, -0.4686214015035236) (2.1575593, 0.14845590638215034) (2.1916535, -0.5140996125798528) (2.2137539, 0.6278134417345228) (2.2772673, -0.35049969044209983) (2.2975726, -0.06036824276546304) (2.3272777, -0.18585219083806218) (2.5217206, -0.03167349168036536) (2.5533438, -0.1611040092884861) (2.5687881, 1.1032200139582564) (2.6567569, 0.04975777739217784) (2.677591, -0.01426285133724671) (2.7180005, 0.07853368755223371) (2.7942279, -0.4071930969456503) (2.8063861, 0.000492545291049501) (2.8124102, -0.019947344959659177) (2.8419982, 0.03023139779978133) (2.8535925, 0.5421291261646886) (2.9204698, 0.3923068894170366) (2.9626924, 0.21639267973240908) (2.9626924, -0.22540434628281075) (2.9729753, 0.2363938458250126) (3.0130809, 0.35136961387278565) (3.0373539, 0.013876918415846595) (3.2752562, 0.49970959078043126) (3.3375474, 0.5436323480304672) (3.3928291, 0.48746004196839055) (3.4355988, 0.3350764608584778) (3.4578927, 0.6127634045652381) (3.5160131, -0.03781697409079157) (3.5307626, 0.2129806543371961) (3.5652984, 0.5528805608876549) (3.5876769, 0.06299042506665305) (3.6309855, 0.5648082098866389) (3.6800909, -0.1588172848952902) (3.7123518, 0.1635062564072022) (3.9843437, 0.7827244309795267) (3.993603, 0.6049246406551748) (4.029806, 0.06372113813964088) (4.1295508, 0.24281029469705093) (4.3851468, 0.5906868686740623) (4.6844434, 0.4055055537895428) (5.477509, 0.7335244827296759) Mean Squared Error = 6.83550847274
So what am I missing? Since the data is taken from official documentation on intrinsic safety, I would suggest that it should correspond to linear regression (and get at least a reasonably good prediction)?