How to get the map task id in Spark?

Is there a way to get the map task id in Spark? For example, if each map task calls a user-defined function, can I get the identifier of this map task from this user-defined function?

+7
scala hadoop yarn apache-spark
source share
1 answer

I'm not sure what you mean by map task id, but you can access task information using TaskContext :

 import org.apache.spark.TaskContext sc.parallelize(Seq[Int](), 4).mapPartitions(_ => { val ctx = TaskContext.get val stageId = ctx.stageId val partId = ctx.partitionId val hostname = java.net.InetAddress.getLocalHost().getHostName() Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname") }).collect.foreach(println) 

Similar functionality was added in PySpark in Spark 2.2.0 ( SPARK-18576 ):

 from pyspark import TaskContext import socket def task_info(*_): ctx = TaskContext() return ["Stage: {0}, Partition: {1}, Host: {2}".format( ctx.stageId(), ctx.partitionId(), socket.gethostname())] for x in sc.parallelize([], 4).mapPartitions(task_info).collect(): print(x) 
+16
source share

All Articles