How to evenly distribute in Spark?

To test how .repartition() works, I executed the following code:

 rdd = sc.parallelize(range(100)) rdd.getNumPartitions() 

rdd.getNumPartitions() led to 4 . Then I ran:

 rdd = rdd.repartition(10) rdd.getNumPartitions() 

rdd.getNumPartitions() this time led to 10 , so now there were 10 sections.

However, I checked the sections:

 rdd.glom().collect() 

The result gave 4 non-empty lists and 6 empty lists. Why weren’t the other 6 lists allocated any items?

+5
source share
1 answer

The repartition () algorithm uses logic to optimize the most efficient way to redistribute data across partitions. In this case, your range is very small, and it does not consider it optimal to actually split the data further. If you used a much wider range, such as 100,000, you would find that it actually redistributes the data.

If you want to force a certain number of partitions to be set, you can specify the number of partitions when loading data initially. At this point, he will try to evenly distribute the data across the sections, even if this is not necessarily optimal. Parallelization function accepts second argument for partitions

  rdd = sc.parallelize(range(100), 10) 

The same will work if you say you read from a text file

  rdd = sc.textFile('path/to/file/, numPartitions) 
+1
source

All Articles