Extract data from the hive table into a spark and perform a connection on RDD

I have two tables in the hive / impala. I want to get data from a table into a spark like rdds and perform a join operation.

I do not want to pass the connection request directly in my hive context. This is just an example. I have more use cases that are not possible with standard HiveQL. How to get all rows, access columns and perform conversion.

Suppose I have two rdds:

val table1 =  hiveContext.hql("select * from tem1")

val table2 =  hiveContext.hql("select * from tem2")

I want to make a connection on rdds in a column named account_id

Ideally, I want to do something like this using rdds using a spark wrapper.

select * from tem1 join tem2 on tem1.account_id=tem2.account_id; 
+4
source share
4

, table1 table2 , .

table1.registerTempTable("t1")
table2.registerTempTable("t2")
table3 = hiveContext.hql("select * from t1 join t2 on t1.account_id=t2.account_id")
+1

, , API DataFrames, (, join , ).

:

val table1 =  hiveContext.sql("select * from tem1")
val table2 =  hiveContext.sql("select * from tem2")
val common_attributes = Seq("account_id")
val joined = table1.join(table2, common_attributes)

API DataFrame : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

+1

table1 table2 DataFrame. rdd, :

lazy val table1_rdd = table1.rdd
lazy val table2_rdd = table2.rdd

. rdd rdd.

See also: https://issues.apache.org/jira/browse/SPARK-6608 and https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql .DataFrame

0
source

You can directly select this column from the following code:

val table1 =  hiveContext.hql("select account_id from tem1")
val table2 =  hiveContext.hql("select account_id from tem2")
val joinedTable = table1.join(table2) 
0
source

All Articles