How to break a key in Spark?

Question

How to break a key in Spark?

Given that the HashPartitioner docs say:

[HashPartitioner] implements hash splitting using Java Object.hashCode.

Let's say I want to break DeviceData into kind .

 case class DeviceData(kind: String, time: Long, data: String)

Would it be correct to split a RDD[DeviceData] into a rewrite of the deviceData.hashCode() method and use only the hash code kind ?

But considering that the HashPartitioner accepts several parameters for partitions, I get confused if I need to know the number of views in advance and what happens if there are more views than partitions?

Is it right that if I write partitioned data to a disk, it will remain divided while reading?

My goal is to call

  deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)

And only DeviceData the same kind value in the iterator.

+6

scala apache-spark

Bar 12 sept '15 at 10:18

source share

2 answers

Would it be right to split the RDD [DeviceData] by overwriting the deviceData.hashCode () method and use only the hash code of the view?

This is not true. If you take the Java Object.hashCode document, you will find the following information about the hashCode general contract:

If two objects are equal in accordance with the equals (Object) method, then calling the hashCode method for each of the two objects should give the same integer result.

So if the notion of equality based solely on the kind device is not suitable for your use case, and I seriously doubt that this happens by redoing hashCode to get the desired partition, this is a bad idea. In general, you should implement your own browser , but it is not required here.

Since, excluding specialized scripts in SQL and GraphX, partitionBy acts only on PairRDD , it makes sense to create RDD[(String, DeviceData)] and use plain HashPartitioner

 deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))

Just keep in mind that in a situation where kind has low power or a very distorted distribution that uses it to partition, it might not be the optimal solution.

+5

zero323 Sep 13 '15 at 5:09

source share

Justin pihony · Accepted Answer · 2015-09-13T00:21:40+0000

How about doing groupByKey with kind . Or another PairRDDFunctions method.

It seems to you that you really do not care about partitioning, just that you get the whole specific view in one processing thread?

Para-functions allow this:

 rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS)) .foreachPartition(...)

However, perhaps you can be a little safer with something more similar:

 rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or several other paired functions guaranteeing the receipt of parts as a whole

How to break a key in Spark?

More articles: