I am using Scala and want to create my own DataFrame function. For example, I want to process a column like an array, iterate over each element and calculate.
To get started, I'm trying to implement my own getMax method. Thus, the column x will have the values ββ[3,8,2,5,9], and the expected output of the method will be 9.
Here's what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = { var maxValue = inputArray(0) for (i <- 1 until inputArray.length if inputArray(i) > maxValue) { maxValue = inputArray(i) } maxValue }
This is what I still have and get this error
"value length is not a member of org.apache.spark.sql.column",
and I donβt know how else to scroll the columns.
def getMax(col: Column): Column = { var maxValue = col(0) for (i <- 1 until col.length if col(i) > maxValue){ maxValue = col(i) } maxValue
}
As soon as I can implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col("value")).as("value_max")
And then I hope that I can use this in an SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected result will be 9, given the input column [3,8,2,5,9]
I follow the answer from another Spark Scala thread - how I repeat the rows in the dataframe and add the calculated values ββas new columns of the data frame , where they create a private standard deviation method. The calculations that I will do will be more complicated than this (for example, I will compare each element in a column), am I going in the right direction or should I look more at user-defined functions?