Convert RDD to LabeledPoint

If I have an RDD with approximately 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 , which tells me that my target dependent variable is in column number 77. But I have not enough knowledge on how to select (partial) columns as functions (for example, I need columns from 23 to 59, from 111 to 357, from 399 to 489). I am wondering if I can apply these:

 val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) 

Any suggestions or recommendations would be highly appreciated.

Maybe I confused RDD with DataFRrame, I can convert rdd to DataFrame using .toDF() or it is easier to achieve the goal using DataFrame than RDD.

+5
source share
1 answer

I assume your data looks more or less:

 import scala.util.Random.{setSeed, nextDouble} setSeed(1) case class Record( foo: Double, target: Double, x1: Double, x2: Double, x3: Double) val rows = sc.parallelize( (1 to 10).map(_ => Record( nextDouble, nextDouble, nextDouble, nextDouble, nextDouble )) ) val df = sqlContext.createDataFrame(rows) df.registerTempTable("df") sqlContext.sql(""" SELECT ROUND(foo, 2) foo, ROUND(target, 2) target, ROUND(x1, 2) x1, ROUND(x2, 2) x2, ROUND(x2, 2) x3 FROM df""").show 

So, we have the data as shown below:

 +----+------+----+----+----+ | foo|target| x1| x2| x3| +----+------+----+----+----+ |0.73| 0.41|0.21|0.33|0.33| |0.01| 0.96|0.94|0.95|0.95| | 0.4| 0.35|0.29|0.51|0.51| |0.77| 0.66|0.16|0.38|0.38| |0.69| 0.81|0.01|0.52|0.52| |0.14| 0.48|0.54|0.58|0.58| |0.62| 0.18|0.01|0.16|0.16| |0.54| 0.97|0.25|0.39|0.39| |0.43| 0.23|0.89|0.04|0.04| |0.66| 0.12|0.65|0.98|0.98| +----+------+----+----+----+ 

and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)) :

 // Map feature names to indices val featInd = List("x1", "x3").map(df.columns.indexOf(_)) // Or if you want to exclude columns val ignored = List("foo", "target", "x2") val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_)) // Get index of target val targetInd = df.columns.indexOf("target") df.rdd.map(r => LabeledPoint( r.getDouble(targetInd), // Get target value // Map feature indices to values Vectors.dense(featInd.map(r.getDouble(_)).toArray) )) 
+10
source

All Articles