I recently started working with Spark. I was looking for answers to the same questions.
When data from one stage is shuffled to the next stage through the network, the executor (s) who process the next stage pull the data from the process of the first stage through tcp. I noticed that the Spark user interface for a particular job displays the read and write readings in random order. The scene also potentially had an โinputโ size (for example, entering HDFS or scanning a hive table).
I noticed that the size of the record in random order from one stage, which was loaded to another stage, did not match those for the read format in random order. If I remember correctly, there are operations such as a reducer that can be performed in the shuffle data before they are transferred to the next stage / executor as optimization. Perhaps this contributes to the difference in size and, therefore, the relevance of the presentation of both values.
Dranyar
source share