Apache Spark: JDBC connection not working

Question

Apache Spark: JDBC connection not working

I already asked this question earlier, but did not receive an answer ( I could not connect to postgres using jdbc in the pyspark shell ).

I successfully installed Spark 1.3.0 in my local windows and ran test programs for testing using the pyspark shell.

Now I want to run Correlations from Mllib on the data stored in Postgresql, but I can not connect to postgresql.

I successfully added the required jar (tested this jar) in the class path by running

pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"

I see that jar has been successfully added to the environment user interface.

When I ran the following in the pyspark shell -

 from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")

I get this ERROR -

 >>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load df = self._ssql_ctx.load(source, joptions) File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__ File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o20.load. : java.sql.SQLException: No suitable driver found for jdbc:postgresql://[host]/[dbname] at java.sql.DriverManager.getConnection(DriverManager.java:602) at java.sql.DriverManager.getConnection(DriverManager.java:207) at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94) at org.apache.spark.sql.jdbc.JDBCRelation.<init> (JDBCRelation.scala:125) at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:619)

+5

postgresql jdbc apache-spark apache-spark-sql

Soni shashank Apr 23 '15 at 11:03

source share

6 answers

8forty · Answer 1 · 2015-06-19T20:52:01+0000

I had this exact problem with mysql / mariadb and got the BIG key from this question

So your pyspark command should be:

 pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Also watch for errors when pyspark starts as "Warning: Local jar ... does not exist, skipping". and "ERROR SparkContext: Jar not found in ...", this probably means that you specified the path incorrectly.

jake256 · Answer 2 · 2015-12-16T17:26:15+0000

A slightly more elegant solution:

 val props = new Properties props.put("driver", "org.postgresql.Driver") sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)

aks · Answer 3 · 2016-09-09T00:14:15+0000

As jake256 suggested

"driver", "org.postgresql.Driver"

Key with value key

was absent. In my case, I started pyspark as:

 pyspark --jars /path/to/postgresql-9.4.1210.jar

with the following instructions:

  from pyspark.sql import DataFrameReader url = 'postgresql://192.168.2.4:5432/postgres' properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'} df = DataFrameReader(sqlContext).jdbc( url='jdbc:%s' % url, table='weather', properties=properties ) df.show() +-------------+-------+-------+-----------+----------+ | city|temp_lo|temp_hi| prcp| date| +-------------+-------+-------+-----------+----------+ |San Francisco| 46| 50| 0.25|1994-11-27| |San Francisco| 43| 57| 0.0|1994-11-29| | Hayward| 54| 37|0.239999995|1994-11-29| +-------------+-------+-------+-----------+----------+

Tested:

Ubuntu 16.04
PostgreSQL server version 9.5.
Postgresql driver used - postgresql-9.4.1210.jar
and the Spark version is spark-2.0.0-bin-hadoop2.6
but I'm also sure that it also needs spark 2.0.0-bin-hadoop2.7.
Java JDK 1.8 64bits

Other JDBC drivers can be found at: https://www.petefreitag.com/articles/jdbc_urls/

The tutorial that I followed is included: https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/

a similar solution was also proposed: pyspark mysql jdbc load Error when calling o23.load There is no suitable driver

Will mcginnis · Answer 4 · 2015-04-23T13:40:43+0000

This error seems to occur when using the wrong version of the JDBC driver. Check out https://jdbc.postgresql.org/download.html to make sure you have the correct one.

Please note in particular:

JDK 1.1 - JDBC 1. Please note that support for 8.0 JDBC 1 support has been removed, so try updating JDK when upgrading your server.
JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains additional support for the javax.sql classes.
JDK 1.4, 1.5 - JDBC 3. It contains support for SSL and javax.sql, but does not require J2EE, as it was added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but most methods are implemented.
JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not, but most methods are implemented.

zhaozhi · Answer 5 · 2015-12-29T08:18:55+0000

see this post, please just post the script after all the parameters. see this

Naveen kumar · Answer 6 · 2016-10-02T07:29:39+0000

It is pretty simple. To connect to an external database to retrieve data in Spark dataframes, an additional jar file is required. For example. MySQL requires the JDBC driver. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar to a path accessible from each node in the cluster. Preferably, this is the path to a shared file system. For example. with a Pouta virtual cluster, this path will be under / shared _data, here I use / shared _data / thirdparty_jars /.

When directly submitting a Spark task from the terminal, you can specify the -driver-class-path argument, indicating additional banks that should be provided to employees with the task. However, this does not work with this approach, so we must configure these paths for the front-end and work nodes in the spark-defaults.conf file, usually in the / opt / spark / conf directory.

spark.driver.extraClassPath / "your-path" / mysql -connector-java-5.1.35-bin.jar spark.executor.extraClassPath / "your path" / mysql -connector-java-5.1.35-bin.jar

Apache Spark: JDBC connection not working

More articles: