If we have a Pandas data frame consisting of a category column and a column of values, we can remove the average value in each category by doing the following:
df["DemeanedValues"] = df.groupby("Category")["Values"].transform(lambda g: g - numpy.mean(g))
As far as I understand, Spark data frames do not directly offer this grouping / conversion operation (I use PySpark in Spark 1.5.0). So what is the best way to implement this calculation?
I tried using grouping / join as follows:
df2 = df.groupBy("Category").mean("Values") df3 = df2.join(df)
But this is very slow, since, as I understand it, each category requires a full scan of the DataFrame.
I think (but have not verified) that I can speed this up significantly if I collect the result of group-by / mean into a dictionary and then use this dictionary in UDF as follows:
nameToMean = {...} f = lambda category, value: value - nameToMean[category] categoryDemeaned = pyspark.sql.functions.udf(f, pyspark.sql.types.DoubleType()) df = df.withColumn("DemeanedValue", categoryDemeaned(df.Category, df.Value))
Is there an idiomatic way to express this type of operation without sacrificing performance?
python pandas apache-spark pyspark apache-spark-sql pyspark-sql
Peter Lubans
source share