Jupyter & PySpark: how to run multiple laptops

Question

Jupyter & PySpark: how to run multiple laptops

I use Spark 1.6.0 on three virtual machines, 1x Master (stand-alone), 2x working with 8G RAM, 2 processors each.

I am using the kernel configuration below:

{ "display_name": "PySpark ", "language": "python3", "argv": [ "/usr/bin/python3", "-m", "IPython.kernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "<mypath>/spark-1.6.0", "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077 --conf spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar" } }

This is currently working. I can use the sc and sqlContext spark context without importing, as in the pyspark shell.

The problem occurs when I use several laptops: On my spark wizard, I see two pyspark-shell applications that make sense, but only one can work at a time. But here, “launching” does not mean doing anything, even if I do not run anything on the laptop, it will appear as “work”. Given this, I can’t share my resources between laptops, which is pretty sad (currently I need to kill the first shell (= the kernel for the laptop) to start the second).

If you have ideas on how to do this, tell me! Also, I'm not sure the way I work with kernels is “best practice,” I already had problems installing spark and jupyter to work together.

thanks all

+6

jupyter apache-spark pyspark

pltrdy Mar 30 '16 at 14:02

source share

1 answer

pcc · Answer 1 · 2017-11-21T10:41:32+0000

The problem is the database used by Spark to store the metastor (Derby). Derby is a lightweight database system and can only run one instance of Spark at a time. The solution is to configure another database system to work with multiple instances (postgres, mysql ...).

For example, you can use postgres DB.

Add postgres jar in spark / jars
Add configuration file (hive-site.xml) to spark conf
Install postgres on your computer.
Add user, password and db for spark / hive to postgres (depends on your values in hive-site.xml)

Example linux shell:

 # download postgres jar wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar # install postgres on your machine pip install postgres # add user, pass and db to postgres psql -d postgres -c "create user hive" psql -d postgres -c "alter user hive with password 'pass'" psql -d postgres -c "create database hive_metastore" psql -d postgres -c "grant all privileges on database hive_metastore to hive"

hive-site.xml:

 <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://localhost:5432/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>pass</value> </property> </configuration>

Jupyter & PySpark: how to run multiple laptops

More articles: