When does a shuffle happen in Apache Spark?

I am optimizing the parameters in Spark and would like to know exactly how Spark shuffles the data.

Exactly, I have a simple word counting program, and I would like to know how spark.shuffle.file.buffer.kb affects runtime. Right now, I see only a slowdown when I make this parameter very high (I assume this prevents each job buffer from being set in memory at the same time).

Can someone explain how Spark performs contractions? For example, data is read and partitioned in RDD, and when the action function is called, Spark sends tasks to work nodes. If the action is a decrease, how does Spark handle it, and how are files or buffers randomly related to this process?

+4
source share
1 answer

Question: Regarding your question regarding when a spuffle is triggered on Spark?

Answer. Any operation join, cogroupor ByKeyincludes storing objects in hashmaps or buffers in memory for grouping or sorting. join, cogroupand groupByKeyuse these data structures in tasks for the steps that are on the side of the sample of tattoos that they run. reduceByKeyand aggregateByKeyuse data structures in tasks for the steps on both sides of the shuffle that they run.

Explanation: How does shuffle work in Spark?

Shuffle - Spark Hadoop. , , Hadoop, Spark.

Spark ( os), Spark. , Spark . Spark , (M) (R) , Hadoop. , M*R .

Hadoop, Spark spark.shuffle.compress . Snappy ( ) LZF. Snappy 33 .

, Spark , , Hadoop, . , , , , , , groupByKey reduceByKey. Spark , .

, Spark , Hadoop, , . , - Spark, Hadoop. . spark.reducer.maxMbInFlight ( 48 ).

Apache Spark, :

+9

All Articles