Connection from Spark / pyspark to PostgreSQL

I installed Spark on a Windows machine and want to use it through Spyder. After some troubleshooting, it seems to work:

import os os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6" from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext spark_config = SparkConf().setMaster("local[8]") sc = SparkContext(conf=spark_config) sqlContext = SQLContext(sc) textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md") textFile.count() textFile.filter(lambda line: "Spark" in line).count() sc.stop() 

This works as expected. Now I want to connect to the Postgres9.3 database running on the same server. I downloaded the JDBC driver here here and put it in the D: \ Analytics \ Spark \ spark_jars folder. Then I created a new file D: \ Analytics \ Spark \ spark-1.4.0-bin-hadoop2.6 \ conf \ spark-defaults.conf containing this line:

 spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar' 

I checked the following code to verify the connection

 import os os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6" from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext spark_config = SparkConf().setMaster("local[8]") sc = SparkContext(conf=spark_config) sqlContext = SQLContext(sc) df = (sqlContext .load(source="jdbc", url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]", dbtable="pubs") ) sc.stop() 

But I get the following error:

 Py4JJavaError: An error occurred while calling o22.load. : java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd at java.sql.DriverManager.getConnection(Unknown Source) at java.sql.DriverManager.getConnection(Unknown Source) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128) at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) 

How can I check if I downloaded the correct .jar file or where else could an error occur?

+5
source share
2 answers

Remove spark-defaults.conf and add SPARK_CLASSPATH to the system environment in python as follows:

 os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar' 
+1
source

I tried the SPARK_CLASSPATH environment variable, but it does not work with Spark 1.6.

Other responses to posts like the ones below suggest adding pyspark command arguments and working.

Unable to connect to postgres using jdbc in pyspark shell

Apache Spark: JDBC connection not working

 pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL> 
+1
source

All Articles