may or may not be the same, depending on the number of partitions and data distribution. We can solve using rdd itself.
For example:
I saved the sample data below in a file and uploaded it to hdfs.
1,type1,300 2,type1,100 3,type2,400 4,type2,500 5,type1,400 6,type3,560 7,type2,200 8,type3,800
and run the following command:
sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3).groupBy(_(1)).mapValues(x=>x.toList.sortBy(_(2)).map(_(0)).mkString("~")).collect()
exit:
Array[(String, String)] = Array((type3,6~8), (type1,2~1~5), (type2,7~3~4))
That is, we grouped the data by type, then sorted by price and combined the identifiers with the symbol "~" as a separator. The above command can be broken as shown below:
val validData=sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3) val groupedData=validData.groupBy(_(1))
we can take a specific group using the command
sortedJoinedData.filter(_._1=="type1").collect()
exit:
Array[(String, String)] = Array((type1,2~1~5))