Understanding Spark Shuffle Spill

If I understand correctly, when the reduction task is about the failure of input blocks in a random order (from the outputs of various map tasks), it first saves them in memory (Q1). When the amount of backup memory of the executor (before changing the memory management (Q2)) is exhausted, the data in the memory "spills" onto the disk. if spark.shuffle.spill.compress is true, then the data in memory is written to disk in a compressed way.

My questions:

Q0: Do I understand correctly?

Q1: Is the collected data always compiled inside the reduction task?

Q2: How can I estimate the amount of artist memory available to collect shuffle blocks?

Q3: I saw the statement that “accidental overflow occurs when your data set cannot fit in memory”, but as far as I know, the backup memory-keeper is large enough to contain all (uncompressed) input blocks in random order out of all its ACTIVE tasks, there should be no leakage, is this true?

If this is the case, in order to avoid spills, you need to make sure that the (uncompressed) data that ends in all parallel tasks from the reduced side is less than the part of the memory reserved for random play?

+4
source share
1 answer

1.6. . , 1.6 . , , . , .

, Apache Spark , :

  • ,
  • , .

  • , .

  • , . , , .

  • , 1.3.1 , . , , , - jstat -gcutil <pid> <period>. , . , , default.parallelism .

  • , ; , , , ..

+4

All Articles