Initialize PySpark to predefine the SparkContext 'sc' variable

When using PySpark, I would like SparkContext to be initialized (in yarn client mode) when creating a new laptop.

The following tutorials describe how to do this in previous versions of ipython / jupyter <4

https://www.dataquest.io/blog/pyspark-installation-guide/

https://npatta01.imtqy.com/2015/07/22/setting_up_pyspark/

I'm not quite sure how to achieve this using a laptop> 4, as stated in http://jupyter.readthedocs.io/en/latest/migrating.html#since-jupyter-does-not-have-profiles-how-do -i-customize-it

I can manually create and configure Sparkcontext, but I do not want our analysts to worry about this.

Does anyone have any ideas?

0
source share
1 answer

Well, the missing profile features in Jupyter also puzzled me in the past, although for a different reason - I wanted to be able to switch between different deep learning systems (Theano and TensorFlow) on demand; I eventually found a solution (described in my blog post here ).

The fact is that although there are no profiles in Jupyter, the startup files for the IPython kernel still exist, and since Pyspark uses this particular kernel, it can be used in your case.

, ​​Pyspark Jupyter, , , script init_spark.py :

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)

~/.ipython/profile_default/startup/ .

, sc Jupyter:

 In [1]: sc
 Out[1]:<pyspark.context.SparkContext at 0x7fcceb7c5fd0>

 In [2]: sc.version
 Out[2]: u'2.0.0'

Apache Toree ( ​​), ( , ).

+1

All Articles