What is the difference between reading shuffle and shuffling?

I need to run a spark program with a huge amount of data. I am trying to optimize the spark program and work through the spark interface and try to reduce the Shuffle part.

There are several components mentioned, shuffling and shuffling. I can understand the difference based on their terminology, but I would like to understand the exact meaning of them and which spark read / write shuffling reduces performance?

I searched through the Internet, but could not find detailed details about them, so I wanted to see if anyone could explain them here.

+7
apache-spark apache-spark-sql
source share
2 answers

From the user interface tooltip

Random View

The total number of bytes and records in random order (includes both data read locally and data read from remote artists

Random recording

Bytes and recordings written to disk so that they can be shuffled in the next step

+5
source share

I recently started working with Spark. I was looking for answers to the same questions.

When data from one stage is shuffled to the next stage through the network, the executor (s) who process the next stage pull the data from the process of the first stage through tcp. I noticed that the Spark user interface for a particular job displays the read and write readings in random order. The scene also potentially had an โ€œinputโ€ size (for example, entering HDFS or scanning a hive table).

I noticed that the size of the record in random order from one stage, which was loaded to another stage, did not match those for the read format in random order. If I remember correctly, there are operations such as a reducer that can be performed in the shuffle data before they are transferred to the next stage / executor as optimization. Perhaps this contributes to the difference in size and, therefore, the relevance of the presentation of both values.

+2
source share

All Articles