How to access RDD tables through Spark SQL as a JDBC distributed query mechanism?

Several stackoverflow posts have answers with partial information on how to access RDD tables through Spark SQL as a JDBC Distributed Query Engine. Therefore, I would like to ask the following questions for complete information on how to do this:

  • In a Spark SQL application, do we need to use a HiveContext to register tables? Or can we use only the SQL context?

  • Where and how do we use HiveThriftServer2.startWithContext?

  • When we run start-thriftserver.sh , as in

/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh - Original spark: // Master spark: 7077 --hiveconf hive.server2.thrift.bind.host Master spark --hiveconf hive .server2.trift.port 10001

Besides specifying the jar and main class of the Spark SQL application, do I need to specify any other parameters?

  1. Are there any other things we need to do?

Thanks.

+5
source share
2 answers

To open temporary DataFrame tables through HiveThriftServer2.startWithContext (), you may need to write and run a simple application, you may not need to run "start -thriftserver.sh".

To your questions:

  • HiveContext is required; "sqlContext" is implicitly converted to a HiveContext in a spark shell

  • write a simple application, for example:

    val hiveContext = new HiveContext (sparkContext) hiveContext.parquetFile (path) .registerTempTable ("my_table1") import org.apache.spark.sql.hive.thriftserver._

    HiveThriftServer2.startWithContext (hiveContext)

  • You do not need to start start-thriftserver.sh, but instead run your own application, for example:

    spark-submit --class com.xxx.MyJdbcApp./package_with_my_app.jar

  • Nothing else on the server side should start by default by default 10000; you can check by connecting to the server with beeline.

+4
source

In Java, I was able to set the data framework as temporary tables and read the contents of the table through beeline (like a regular beehive table)

I did not host the entire program (with the assumption that you already know how to create dataframes)

 import org.apache.spark.sql.hive.thriftserver.*; HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class); 

orgPairRdd is JavaPairRDD, orgPairRdd.values ​​() -> contains the entire class value (a string from Hbase)

OrgMaster is a serializable java bean class

 orgDf.registerTempTable("spark_org_master_table"); HiveThriftServer2.startWithContext(sqlContext); 

I sent the program locally (since the Hive server thrifty server does not work in port 10000 on this computer)

 hadoop_classpath=$(hadoop classpath) HBASE_CLASSPATH=$(hbase classpath) spark-1.5.2/bin/spark-submit --name tempSparkTable --class packageName.SparkCreateOrgMasterTableFile --master local[4] --num-executors 4 --executor-cores 4 --executor-memory 8G --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.executor.extraClassPath=${hadoop_classpath}" --conf --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar /path/programName-SNAPSHOT.jar 

On another launch of the terminal, beeline pointing to this support service started using this spark program

 /opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername 

Show tables → the command will display the table registered in Spark

You can also describe

In this example

 describe spark_org_master_table; 

then you can run regular queries in beeline against this table. (Until you kill the execution of the spark program)

0
source

All Articles