Spark HiveContext does not retrieve newly inserted records from Hive table

Question

Spark HiveContext does not retrieve newly inserted records from Hive table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following

val hx = new HiveContext(sc) import hx.implicits._ hx.sql("select * from tab").show

// this is normal, the result was shown as expected

then I entered a few entries in a tab from the beeline console

 hx.refreshTable("tab") hx.sql("select * from tab").show

// still old records but not inserted records

My question is: why did HiveContext not retrieve newly inserted records?

+5

apache-spark-sql

david2028 Jul 21 '15 at 15:03

source share

3 answers

vijay kumar · Answer 1 · 2015-07-24T11:56:57+0000

hiveContext. refreshTable (tableName: String) - this will only update the table metadata (not the actual data)

Official Documentation Notes: (Credits: https://spark.apache.org )

refreshTable (tableName: String): Unit

Invalid and update all cached metadata for this table. For performance reasons, Spark SQL or its library of external data sources may cache certain table metadata, such as the location of blocks. When these changes go beyond Spark SQL, users must call this function to invalidate the cache.

To retrieve newly inserted records: - first drop and reuse the cache using uncacheTable (String tableName) and cacheTable (String tableName)

Mohan · Answer 2 · 2016-04-07T12:34:54+0000

If the target table is partitioned, you need to insert the "partition" option. If you skip the section, the data will not be visible.

 INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2

reim · Answer 3 · 2015-10-13T13:24:32+0000

In another case, I have an RDD coming from a Spark SQL statement through a HiveContext . The solution that worked for me after some experimentation was to actually restore the RDD itself.

It doesn’t matter if you use DDL Spark SQL or send SQL statements directly through hiveContext.sql .

I saw around people using the "counting trick" to force the recalculation of the data set, but at least in my attempts I could not see the new data in this way.

In any case, the attempt to cache, update and friends did not work for me, if someone has the correct template, please share.

Spark HiveContext does not retrieve newly inserted records from Hive table

More articles: