How to minimize shuffling on a Spark dataframe Join?

Question

How to minimize shuffling on a Spark dataframe Join?

I have two data frames such as

student_rdf = (studentid, name, ...) student_result_rdf = (studentid, gpa, ...)

we need to join these two data frames. we are doing it now

 student_rdf.join(student_result_rdf, student_result_rdf["studentid"] == student_rdf["studentid"])

So simple. But it creates a lot of data shuffling between work nodes, but since joining the key is similar, and if the file frame can (understand the section) be partitioned using this key (studentid), then it is assumed that it is not shuffled at all. Because such data will be in a similar node. Is it possible?

I find a way to split data based on a column when I read a data frame from input. And if it is possible that Spark will realize that the two separation keys from the two data frames are similar, then how?

+4

apache-spark

Zer001 Aug 10 '15 at 11:08

source share

No one has answered this question yet.

See related questions:

99

How to define a split DataFrame?

68

How are stages broken down into tasks in Spark?

12

Shared connections in SQL sparks