Skewed Apache Spark Data

I have two tables that I would like to combine. One of them has very poor data skew. This makes my spark job not work in parallel, since most of the work is done on one partition.

I heard and read and tried to implement pickling my keys to increase distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

Any help or advice would be appreciated. Thanks!

+5
source share
1 answer

Yes, you should use the salt keys on a large table (using randomization), and then replicate the smaller one-cart one to connect it to the new salt:

Here are some suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python transcoding: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library is as follows:

 import com.tresata.spark.skewjoin.Dsl._ // for the implicits // skewjoin() method pulled in by the implicits rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2), DefaultSkewReplication(1)).sortByKey(true).collect.toLis 
+2
source

All Articles