To expand and Scala -fy nealmcb answer (the question was tagged scala, not python, so I do not think this answer will be irrelevant or redundant), suppose you have a DataFrame:
import org.apache.spark.sql val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
And somehow get the maximum or whatever you want to memoize in the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata can only contain strings, booleans, some types of numbers, and other metadata structures. Therefore, we need to use Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn () actually has an overload that allows you to supply the metadata argument at the end, but is inexplicably marked [private], so we just do what it does - use Column.as(alias, metadata) :
val newColumn = df.col("randInt").as("randInt_withMax", metadata) val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax now has (column c) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}")) > randInt: metadata={} > randInt_withMax: metadata={"columnMax":2094414111}
Or programmatically and safely (for example, Metadata.getLong () and others do not return Option and may throw a key not found exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax") > res29: Long = 209341992
Attaching max to a column makes sense in your case, but in general, with binding metadata to a DataFrame, and not to a column, in particular, you will need to take a wrapper route described by other answers.