How to access RDD tables through Spark SQL as a JDBC distributed query mechanism?

Question

How to access RDD tables through Spark SQL as a JDBC distributed query mechanism?

Several stackoverflow posts have answers with partial information on how to access RDD tables through Spark SQL as a JDBC Distributed Query Engine. Therefore, I would like to ask the following questions for complete information on how to do this:

In a Spark SQL application, do we need to use a HiveContext to register tables? Or can we use only the SQL context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh , as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh - Original spark: // Master spark: 7077 --hiveconf hive.server2.thrift.bind.host Master spark --hiveconf hive .server2.trift.port 10001

Besides specifying the jar and main class of the Spark SQL application, do I need to specify any other parameters?

Are there any other things we need to do?

Thanks.

+5

apache-spark apache-spark-sql

Michael Jul 18 '15 at 16:08

source share

2 answers

Haiying wang · Answer 1 · 2015-07-20T16:38:35+0000

To open temporary DataFrame tables through HiveThriftServer2.startWithContext (), you may need to write and run a simple application, you may not need to run "start -thriftserver.sh".

To your questions:

HiveContext is required; "sqlContext" is implicitly converted to a HiveContext in a spark shell
write a simple application, for example:
val hiveContext = new HiveContext (sparkContext) hiveContext.parquetFile (path) .registerTempTable ("my_table1") import org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext (hiveContext)
You do not need to start start-thriftserver.sh, but instead run your own application, for example:
spark-submit --class com.xxx.MyJdbcApp./package_with_my_app.jar
Nothing else on the server side should start by default by default 10000; you can check by connecting to the server with beeline.

Anandkumar · Answer 2 · 2016-07-15T15:55:17+0000

In Java, I was able to set the data framework as temporary tables and read the contents of the table through beeline (like a regular beehive table)

I did not host the entire program (with the assumption that you already know how to create dataframes)

 import org.apache.spark.sql.hive.thriftserver.*; HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);

orgPairRdd is JavaPairRDD, orgPairRdd.values () -> contains the entire class value (a string from Hbase)

OrgMaster is a serializable java bean class

 orgDf.registerTempTable("spark_org_master_table"); HiveThriftServer2.startWithContext(sqlContext);

I sent the program locally (since the Hive server thrifty server does not work in port 10000 on this computer)

 hadoop_classpath=$(hadoop classpath) HBASE_CLASSPATH=$(hbase classpath) spark-1.5.2/bin/spark-submit --name tempSparkTable --class packageName.SparkCreateOrgMasterTableFile --master local[4] --num-executors 4 --executor-cores 4 --executor-memory 8G --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.executor.extraClassPath=${hadoop_classpath}" --conf --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar /path/programName-SNAPSHOT.jar

On another launch of the terminal, beeline pointing to this support service started using this spark program

 /opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername

Show tables → the command will display the table registered in Spark

You can also describe

In this example

 describe spark_org_master_table;

then you can run regular queries in beeline against this table. (Until you kill the execution of the spark program)

How to access RDD tables through Spark SQL as a JDBC distributed query mechanism?

More articles: