Jupyter & PySpark: how to run multiple laptops

I use Spark 1.6.0 on three virtual machines, 1x Master (stand-alone), 2x working with 8G RAM, 2 processors each.

I am using the kernel configuration below:

{ "display_name": "PySpark ", "language": "python3", "argv": [ "/usr/bin/python3", "-m", "IPython.kernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "<mypath>/spark-1.6.0", "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077 --conf spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar" } } 

This is currently working. I can use the sc and sqlContext spark context without importing, as in the pyspark shell.

The problem occurs when I use several laptops: On my spark wizard, I see two pyspark-shell applications that make sense, but only one can work at a time. But here, “launching” does not mean doing anything, even if I do not run anything on the laptop, it will appear as “work”. Given this, I can’t share my resources between laptops, which is pretty sad (currently I need to kill the first shell (= the kernel for the laptop) to start the second).

If you have ideas on how to do this, tell me! Also, I'm not sure the way I work with kernels is “best practice,” I already had problems installing spark and jupyter to work together.

thanks all

+6
source share
1 answer

The problem is the database used by Spark to store the metastor (Derby). Derby is a lightweight database system and can only run one instance of Spark at a time. The solution is to configure another database system to work with multiple instances (postgres, mysql ...).

For example, you can use postgres DB.

  • Add postgres jar in spark / jars
  • Add configuration file (hive-site.xml) to spark conf
  • Install postgres on your computer.
  • Add user, password and db for spark / hive to postgres (depends on your values ​​in hive-site.xml)

Example linux shell:

 # download postgres jar wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar # install postgres on your machine pip install postgres # add user, pass and db to postgres psql -d postgres -c "create user hive" psql -d postgres -c "alter user hive with password 'pass'" psql -d postgres -c "create database hive_metastore" psql -d postgres -c "grant all privileges on database hive_metastore to hive" 

hive-site.xml:

 <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://localhost:5432/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>pass</value> </property> </configuration> 
0
source

All Articles