Scala add new column to dataframe by expression

I am going to add a new column in a dataframe with an expression. for example i have a dataframe

+-----+----------+----------+-----+ | C1 | C2 | C3 |C4 | +-----+----------+----------+-----+ |steak|1 |1 | 150| |steak|2 |2 | 180| | fish|3 |3 | 100| +-----+----------+----------+-----+ 

and I want to create a new column C5 with the expression "C2 / C3 + C4", assuming that you need to add several new columns, and the expressions may be different and come from the database.

Is there a good way to do this?

I know that if I have an expression like "2 + 3 * 4", I can use scala.tools.reflect.ToolBox to evaluate it.

And usually I use df.withColumn to add a new column.

It seems I need to create a UDF, but how can I pass the value of the columns as parameters for UDF? especially there, perhaps several expressions require the calculation of different columns.

+7
scala dataframe apache-spark
source share
3 answers

This can be done using expr to create a Column from an expression:

 val df = Seq((1,2)).toDF("x","y") val myExpression = "x+y" import org.apache.spark.sql.functions.expr df.withColumn("z",expr(myExpression)).show() +---+---+---+ | x| y| z| +---+---+---+ | 1| 2| 3| +---+---+---+ 
+6
source share

Two approaches:

  import spark.implicits._ //so that you could use .toDF val df = Seq( ("steak", 1, 1, 150), ("steak", 2, 2, 180), ("fish", 3, 3, 100) ).toDF("C1", "C2", "C3", "C4") import org.apache.spark.sql.functions._ // 1st approach using expr df.withColumn("C5", expr("C2/(C3 + C4)")).show() // 2nd approach using selectExpr df.selectExpr("*", "(C2/(C3 + C4)) as C5").show() +-----+---+---+---+--------------------+ | C1| C2| C3| C4| C5| +-----+---+---+---+--------------------+ |steak| 1| 1|150|0.006622516556291391| |steak| 2| 2|180| 0.01098901098901099| | fish| 3| 3|100| 0.02912621359223301| +-----+---+---+---+--------------------+ 
+3
source share

In Spark 2.x, you can create a new C5 column with the expression "C2 / C3 + C4" using withColumn() and org.apache.spark.sql.functions._ ,

  val currentDf = Seq( ("steak", 1, 1, 150), ("steak", 2, 2, 180), ("fish", 3, 3, 100) ).toDF("C1", "C2", "C3", "C4") val requiredDf = currentDf .withColumn("C5", (col("C2")/col("C3")+col("C4"))) 

Alternatively, you can do the same with org.apache.spark.sql.Column . (But the complexity of the space in this approach is higher than when using org.apache.spark.sql.functions._ due to the creation of the Column object)

  val requiredDf = currentDf .withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4"))) 

This worked great for me. I am using Spark 2.0.2.

+1
source share

All Articles