I have two data frames such as
student_rdf = (studentid, name, ...) student_result_rdf = (studentid, gpa, ...)
we need to join these two data frames. we are doing it now
student_rdf.join(student_result_rdf, student_result_rdf["studentid"] == student_rdf["studentid"])
So simple. But it creates a lot of data shuffling between work nodes, but since joining the key is similar, and if the file frame can (understand the section) be partitioned using this key (studentid), then it is assumed that it is not shuffled at all. Because such data will be in a similar node. Is it possible?
I find a way to split data based on a column when I read a data frame from input. And if it is possible that Spark will realize that the two separation keys from the two data frames are similar, then how?
source share