What does the stage in spark logs mean?

Question

What does the stage in spark logs mean?

When I start work using spark, do I get the following logs?

[Stage 0:> (0 + 32) / 32]

Here 32 corresponds to the number of rdd partitions I requested.

However, I do not understand why there are several stages and what exactly happens at each stage.

Each stage seems to take a lot of time. Is it possible to do this in several stages?

+8

mapreduce apache-spark pyspark apache-spark-sql

Harit vishwakarma Oct 7 '15 at 14:29

source share

1 answer

Justin pihony · Answer 1 · 2015-10-07T15:07:23+0000

A stage in Spark is a segment of DAG computation that runs locally. The stage is broken down into an operation that requires data shuffling, so you will see that it is called this operation in the Spark user interface. If you are using Spark 1.4+, you can even visualize this in the user interface in the DAG visualization section:

Note that the separation occurs in reduceByKey , which requires shuffling to complete the full execution.

What does the stage in spark logs mean?

More articles: