From Davis Liu (DataBricks):
"At present, PySpark cannot support sorting a class object in the current script (' main ), a class implementation can be implemented into a separate module in a workaround, then use" bin / spark-submit --py-files xxx.py "in the deployment.
in xxx.py:
class test(object): def __init__(self, a, b): self.total = a + b
in job.py:
from xxx import test a = sc.parallelize([(True,False),(False,False)]) a.map(lambda (x,y): test(x,y))
run it:
bin/spark-submit
"
I just want to point out that you can pass the same argument (-py-files) to Spark Shell too.
source share