Configure Spark SQL Connection with Kerberos

Question

Configure Spark SQL Connection with Kerberos

I have a simple Java application that can connect and query my cluster using Hive or Impala using code like

import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; ... Class.forName("com.cloudera.hive.jdbc41.HS2Driver"); Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive"); Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery("select * from foobar");

But now I want to try to make the same query, but with Spark SQL. Although it’s hard for me to understand how to use the Spark SQL API. Specifically, how to configure the connection. I see examples of how to configure Spark Session, but it is not clear what values I need to provide, for example

  SparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate();

How do I tell Spark SQL which host and port to use, which schema to use, and how to specify Spark SQL, which authentication technique I use? For example, I use Kerberos for authentication.

The above Spark SQL code is located at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java

UPDATE:

I was able to make some progress, and I thought about how to determine the SQL Spark connection, which host and port to use.

 ... SparkSession spark = SparkSession .builder() .master("spark://myHostIP:10000") .appName("Java Spark Hive Example") .enableHiveSupport() .getOrCreate();

And I added the following dependency in my pom.xml file

 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.0.0</version> </dependency>

With this update, I see that the connection is getting further and further, but it does not seem to work now because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here is the relevant log data

 2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils : Successfully started service 'SparkUI' on port 4040. 2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040 2017-12-19 11:17:56.065 INFO 11912 --- [er-threadpool-0] sdcStandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000... 2017-12-19 11:17:56.260 INFO 11912 --- [pc-connection-0] oasnclient.TransportClientFactory : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps) 2017-12-19 11:17:56.354 WARN 11912 --- [huffle-client-0] oasnserver.TransportChannelHandler : Exception in connection from myHostIP:10000 java.io.IOException: An existing connection was forcibly closed by the remote host

+9

java kerberos apache-spark apache-spark-sql

Kyle bridenstine Dec 14 '17 at 9:06

source share

2 answers

morfious902002 · Answer 1 · 2018-05-15T18:29:28+0000

Creating a data frame using Impala with Kerberos authentication

I can make an Impala connection with Kerberos authentication. Checkout my Git repository here. Maybe this will help.

Oleg · Answer 2 · 2019-01-28T14:41:21+0000

You can try logging into Kerberos before starting the connection:

  Configuration conf = new Configuration(); conf.set("fs.hdfs.impl", DistributedFileSystem.class.getName()); conf.addResource(pathToHdfsSite); conf.addResource(pathToCoreSite); conf.set("hadoop.security.authentication", "kerberos"); conf.set("hadoop.rpc.protection", "privacy"); UserGroupInformation.setConfiguration(conf); UserGroupInformation.loginUserFromKeytab(ktUserName, ktPath); //your code here

And for this you need to have core-site.xml, hdfs-site.xml and keytab on your computer.

Configure Spark SQL Connection with Kerberos

More articles: