Would it be right to split the RDD [DeviceData] by overwriting the deviceData.hashCode () method and use only the hash code of the view?
This is not true. If you take the Java Object.hashCode document, you will find the following information about the hashCode general contract:
If two objects are equal in accordance with the equals (Object) method, then calling the hashCode method for each of the two objects should give the same integer result.
So if the notion of equality based solely on the kind device is not suitable for your use case, and I seriously doubt that this happens by redoing hashCode to get the desired partition, this is a bad idea. In general, you should implement your own browser , but it is not required here.
Since, excluding specialized scripts in SQL and GraphX, partitionBy acts only on PairRDD , it makes sense to create RDD[(String, DeviceData)] and use plain HashPartitioner
deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))
Just keep in mind that in a situation where kind has low power or a very distorted distribution that uses it to partition, it might not be the optimal solution.
source share