I have a simple Java application that can connect and query my cluster using Hive or Impala using code like
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; ... Class.forName("com.cloudera.hive.jdbc41.HS2Driver"); Connection con = DriverManager.getConnection("jdbc:hive2://myHostIP:10000/mySchemaName;hive.execution.engine=spark;AuthMech=1;KrbRealm=myHostIP;KrbHostFQDN=myHostIP;KrbServiceName=hive"); Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery("select * from foobar");
But now I want to try to make the same query, but with Spark SQL. Although itβs hard for me to understand how to use the Spark SQL API. Specifically, how to configure the connection. I see examples of how to configure Spark Session, but it is not clear what values ββI need to provide, for example
SparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate();
How do I tell Spark SQL which host and port to use, which schema to use, and how to specify Spark SQL, which authentication technique I use? For example, I use Kerberos for authentication.
The above Spark SQL code is located at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java
UPDATE:
I was able to make some progress, and I thought about how to determine the SQL Spark connection, which host and port to use.
... SparkSession spark = SparkSession .builder() .master("spark://myHostIP:10000") .appName("Java Spark Hive Example") .enableHiveSupport() .getOrCreate();
And I added the following dependency in my pom.xml file
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.0.0</version> </dependency>
With this update, I see that the connection is getting further and further, but it does not seem to work now because I'm not authenticated. I need to figure out how to authenticate using Kerberos. Here is the relevant log data
2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.util.Utils : Successfully started service 'SparkUI' on port 4040. 2017-12-19 11:17:55.717 INFO 11912 --- [o-auto-1-exec-1] org.apache.spark.ui.SparkUI : Bound SparkUI to 0.0.0.0, and started at http://myHostIP:4040 2017-12-19 11:17:56.065 INFO 11912 --- [er-threadpool-0] sdcStandaloneAppClient$ClientEndpoint : Connecting to master spark://myHostIP:10000... 2017-12-19 11:17:56.260 INFO 11912 --- [pc-connection-0] oasnclient.TransportClientFactory : Successfully created connection to myHostIP:10000 after 113 ms (0 ms spent in bootstraps) 2017-12-19 11:17:56.354 WARN 11912 --- [huffle-client-0] oasnserver.TransportChannelHandler : Exception in connection from myHostIP:10000 java.io.IOException: An existing connection was forcibly closed by the remote host