Failed to connect to postgres using jdbc in pyspark shell

Question

Failed to connect to postgres using jdbc in pyspark shell

I use a standalone cluster in my local windows and try to download data from one of our servers using the following code -

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc", url="jdbc:postgresql://host/dbname", dbtable="schema.tablename")

I set SPARK_CLASSPATH as -

 os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar"

When executing sqlContext.load, it displays the error message “No suitable driver for jdbc: postgresql found”. I tried searching on the internet but could not find a solution.

+7

postgresql jdbc apache-spark pyspark apache-spark-sql

Soni shashank Apr 16 '15 at 8:34

source share

2 answers

avkghost · Answer 1 · 2017-09-22T08:42:25+0000

Maybe it will be helpful.

In my environment, SPARK_CLASSPATH contains the path to the postgresql connector

 from pyspark import SparkContext, SparkConf from pyspark.sql import DataFrameReader, SQLContext import os sparkClassPath = os.getenv('SPARK_CLASSPATH', '/path/to/connector/postgresql-42.1.4.jar') # Populate configuration conf = SparkConf() conf.setAppName('application') conf.set('spark.jars', 'file:%s' % sparkClassPath) conf.set('spark.executor.extraClassPath', sparkClassPath) conf.set('spark.driver.extraClassPath', sparkClassPath) # Uncomment line below and modify ip address if you need to use cluster on different IP address #conf.set('spark.master', 'spark://127.0.0.1:7077') sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) url = 'postgresql://127.0.0.1:5432/postgresql' properties = {'user':'username', 'password':'password'} df = DataFrameReader(sqlContext).jdbc(url='jdbc:%s' % url, table='tablename', properties=properties) df.printSchema() df.show()

This piece of code allows you to use pyspark where you need it. For example, I used it in a Django project.

8forty · Answer 2 · 2015-06-20T00:02:21+0000

I had the same issue with mysql and I could never get it to work with the SPARK_CLASSPATH approach. However, I got it to work with additional command line arguments, see the answer to this question

To not push, to make it work, here is what you need to do:

 pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Failed to connect to postgres using jdbc in pyspark shell

More articles: