When I work with DataFrames in Spark, I sometimes have to edit only the values โโof a specific column in this DataFrame. E.g. if I have a count field in my data framework, and if I would like to add 1 to each count value, then I could either write a custom udf to do the job using the withColumn DataFrames function, or I could make a map in a DataFrame, and then extract another DataFrame from the resulting RDD.
I would like to know how udf works under the hood. Give me a comparison in using map / udf in this case. What is the difference in performance?
Thanks!
source share