Resources / Documentation on how the fault tolerance process works for Spark Driver (and its YARN container) in thread cluster mode

I'm trying to figure out if Spark Driver is the only point of failure when deploying in cluster mode for yarn. Therefore, I would like to better understand the insides of the fault tolerance process regarding the YARN Spark Driver container in this context.

I know that the Spark Driver will run in the Spark Application Master inside the Yarn Container. The Spark Application Wizard will request resources for the YARN Resource Manager, if required. But I could not find a document with sufficiently detailed information about fault tolerance in the event of a YARN container failure from the Spark master (and the Spark driver).

I am trying to find some detailed resources that can allow me to answer some questions related to the following scenario: If the host machine is a YARN container that launches the Spark Application Master / Spark Network loss from driver loss for 1 hour:

  • Is the YARN Resource Manager creating a new YARN container with a different Spark Application Master / Spark driver?

  • In this case (creating a new YARN container) does Spark Driver start from scratch if at least 1 stage in 1 of the executors was completed and was notified as such to the original driver before it worked? Is the option used in persist () used here? And will the new Spark Driver know that the artist has completed stage 1? Would Tachion help in this scenario?

  • , YARN Spark Application Master? , YARN, , SPARK .

, /-, "-" .

+4
1

YARN, . , . ( .)

, - . ( ) - , , . , , . SparkContext , RDD , .

, , . Spark . RDD ( saveAsTextFile) . RDD, , , .

, , . , , , , .

+4

All Articles