I have an external table displayed in Hive (v2.3.2 on EMR-5.11.0), which I need to update with new data about once a week. A merge consists of an upsert conditional.
The table location is in s3, and there is always data (created once, and we just need to update it with new data).
I read this blog about data merging in Hive using the ACID function in transactional tables ( https://dzone.com/articles/update-hive-tables-the-easy-way-part-2-hortonworks ), but as much as possible I see, the only solution is to copy my external table into a temporary internal Hive table, clustered and transactional, then only in this table I can merge and redefine the original data with the new merged one.
This table is quite large (about 10 GB of data), so I would like to avoid copying it before each merge operation.
Is there a way to create an internal table and map it to existing data? or is there another way, besides the merge operator, to reload external Hive tables?
Many thanks!
hadoop emr hive acid orc
Meori lehr
source share