Extract data from the hive table into a spark and perform a connection on RDD

Question

Extract data from the hive table into a spark and perform a connection on RDD

I have two tables in the hive / impala. I want to get data from a table into a spark like rdds and perform a join operation.

I do not want to pass the connection request directly in my hive context. This is just an example. I have more use cases that are not possible with standard HiveQL. How to get all rows, access columns and perform conversion.

Suppose I have two rdds:

val table1 =  hiveContext.hql("select * from tem1")

val table2 =  hiveContext.hql("select * from tem2")

I want to make a connection on rdds in a column named account_id

Ideally, I want to do something like this using rdds using a spark wrapper.

select * from tem1 join tem2 on tem1.account_id=tem2.account_id;

+4

scala apache-spark apache-spark-sql rdd

user1189851 Nov 06 '14 at 17:29

source share

4

Holden · Answer 1 · 2014-11-06T22:35:09+0000

, table1 table2 , .

table1.registerTempTable("t1")
table2.registerTempTable("t2")
table3 = hiveContext.hql("select * from t1 join t2 on t1.account_id=t2.account_id")

Daniel de Paula · Answer 2 · 2016-05-03T20:30:10+0000

, , API DataFrames, (, join , ).

:

val table1 =  hiveContext.sql("select * from tem1")
val table2 =  hiveContext.sql("select * from tem2")
val common_attributes = Seq("account_id")
val joined = table1.join(table2, common_attributes)

API DataFrame : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Blaubaer · Answer 3 · 2015-06-19T14:39:35+0000

table1 table2 DataFrame. rdd, :

lazy val table1_rdd = table1.rdd
lazy val table2_rdd = table2.rdd

. rdd rdd.

See also: https://issues.apache.org/jira/browse/SPARK-6608 and https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql .DataFrame

Badboy777 · Answer 4 · 2017-02-01T20:15:19+0000

You can directly select this column from the following code:

val table1 =  hiveContext.hql("select account_id from tem1")
val table2 =  hiveContext.hql("select account_id from tem2")
val joinedTable = table1.join(table2)

Extract data from the hive table into a spark and perform a connection on RDD

More articles: