I am trying to run pyspark from my mac to compute on an EC2 spark cluster.
If I log into the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2 $ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)` >>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1 14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false) 14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1) ... 14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s 1000
But now, if I try to do the same from a local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
he cannot connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077... 14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory ... File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect. : org.apache.spark.SparkException: Job aborted: Spark cluster looks down 14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was with ec2 security, but that doesnβt help even after adding rules for incoming and subordinate security groups to accept all ports.
Any help would be greatly appreciated!
Others ask the same question on the mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html# a8465
amazon-web-services amazon-ec2 apache-spark
Anthony
source share