When I first start
lines = sc.textFile(os.path.join(folder_name),100)
and then
parsed_lines=lines.map(lambda line: parse_line(line, ["udid"])).persist(StorageLevel.MEMORY_AND_DISK).groupByKey(1000).take(10)
I get the following error:
...
ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 21
...
WARN TaskSetManager: Lost task 0.1 in stage 11.7 (TID 1151, <machine name>): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=896, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
I tried to change the following parameters, as well as the number of sections in groupbykey and the number of sections in the textFile function.
conf.set("spark.cores.max", "128")
conf.set("spark.akka.frameSize", "1024")
conf.set("spark.executor.memory", "6G")
conf.set("spark.shuffle.file.buffer.kb", "100")
I am not sure how to determine these parameters based on the capabilities of the workers, the size of the input, and the transformations that I will apply.
source
share