How to add a new column in Spark RDD?

I have an RDD with MANY columns (e.g. hundreds ), how to add another column at the end of this RDD?

For example, if my RDD looks like this:

123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758 

how can i add a column to it whose value is the sum of the second and third columns?

Many thanks.

+5
source share
2 answers

You do not need to use Tuple * objects at all to add a new column to RDD .

This can be done by matching each line, taking its original content plus the elements you want to add, for example:

 val rdd = ... val withAppendedColumnsRdd = rdd.map(row => { val originalColumns = row.toSeq.toList val secondColValue = originalColumns(1).asInstanceOf[Int] val thirdColValue = originalColumns(2).asInstanceOf[Int] val newColumnValue = secondColValue + thirdColValue Row.fromSeq(originalColumns :+ newColumnValue) // Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns }) 
+7
source

you have tuple 4 RDD, apply the map and convert it to tuple5

 val rddTuple4RDD = ........... val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3)) 
+3
source

All Articles