Skewed Apache Spark Data

Question

Skewed Apache Spark Data

I have two tables that I would like to combine. One of them has very poor data skew. This makes my spark job not work in parallel, since most of the work is done on one partition.

I heard and read and tried to implement pickling my keys to increase distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

Any help or advice would be appreciated. Thanks!

+5

scala hadoop apache-spark spark-dataframe

John engelhart Aug 15 '16 at 18:14

source share

1 answer

javadba · Accepted Answer · 2016-08-15T19:58:22+0000

Yes, you should use the salt keys on a large table (using randomization), and then replicate the smaller one-cart one to connect it to the new salt:

Here are some suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python transcoding: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library is as follows:

 import com.tresata.spark.skewjoin.Dsl._ // for the implicits // skewjoin() method pulled in by the implicits rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2), DefaultSkewReplication(1)).sortByKey(true).collect.toLis

Skewed Apache Spark Data

More articles: