I have a DataFrame spark with multiple columns. I would like to group the rows based on one column, and then find the second column mode for each group. Working with pandas DataFrame, I would do something like this:
rand_values = np.random.randint(max_value,
size=num_values).reshape((num_values/2, 2))
rand_values = pd.DataFrame(rand_values, columns=['x', 'y'])
rand_values['x'] = rand_values['x'] > max_value/2
rand_values['x'] = rand_values['x'].astype('int32')
print(rand_values)
def mode(series):
return scipy.stats.mode(series['y'])[0][0]
rand_values.groupby('x').apply(mode)
Inside pyspark, I can find the single column mode doing
df = sql_context.createDataFrame(rand_values)
def mode_spark(df, column):
counts = df.groupBy(column).count()
mode = counts.join(
counts.agg(F.max('count').alias('count')),
on='count'
).limit(1).select(column)
return mode.first()[column]
mode_spark(df, 'x')
mode_spark(df, 'y')
I do not understand how to apply this function to grouped data. If it is not possible to apply this logic to a DataFrame, is it possible to achieve the same effect in other ways?
Thank you in advance!
source
share