Spark: enter vector

Question

Spark: enter vector

I get caught in a spark and I have problems with Import vectors.apache.spark.mllib.linalg Vectors. {Vectors, Vector}

The input of my program is a text file containing the output of RDD (Vector): dataset.txt:

[-0.5069793074881704,-2.368342680619545,-3.401324690974588] [-0.7346396928543871,-2.3407983487917448,-2.793949129209909] [-0.9174226561793709,-0.8027635530022152,-1.701699021443242] [0.510736518683609,-2.7304268743276174,-2.418865539558031]

So what you need to do is:

 val rdd = sc.textFile("/workingdirectory/dataset") val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

I have an error because it reads [0.510736518683609 as a number. Are there any forms for directly loading a vector stored in a text file without executing the second line? How can I remove the "[" at the map stage? I'm really new to the spark, sorry if this is a very obvious question.

+6

scala apache-spark apache-spark-mllib

Jhon harris Oct 24 '15 at 9:08

source share

2 answers

zero323 · Answer 1 · 2015-10-24T09:58:17+0000

Given the input, the simplest thing you can do is use Vectors.parse :

 scala> import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala> Vectors.parse("[-0.50,-2.36,-3.40]") res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]

It also works with a sparse view:

 scala> Vectors.parse("(10,[1,5],[0.5,-1.0])") res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])

Combining it with your data, you need:

 rdd.map(Vectors.parse)

If you expect incorrect / empty lines, you can wrap them using Try :

 import scala.util.Try rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)

eliasah · Answer 2 · 2015-10-24T09:15:28+0000

Here is one way to do this:

 val rdd = sc.textFile("/workingdirectory/dataset") val data = rdd.map { s => val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble) Vectors.dense(vect) }

I just broke a map into a string for readability.

Note: Remember that this is a simple line processing on each line.

Spark: enter vector

More articles: