What is a metastor in Spark?

I am using SparkSQL in python. I created a partitioned table (~ several hundred partitions) that stored it in an internal Hive table using hiveContext. Warehouse storage is located in S3.

When I just do "df = hiveContext.table (" mytable "), it will take me a minute to go through all the sections for the first time. I thought the metastore saved all the metadata. Through each section? Can this step be avoided so that my Could the launch be faster?

+4
source share
1 answer

The key point here is that it only takes a lot of time for the first request to download the file metadata. The reason is that SparkSQL does not save section metadata in the hive metastar. For partitioned Hive tables, partition information should be stored in the metastore. Depending on how the table is created, how it behaves will be determined. From the information provided, it seems that you have created a SparkSQL table.

SparkSQL stores the table schema (which includes the partition information) and the root directory of your table, but each time the query starts, it discovers each partition directory in S3 dynamically. I understand that this is a compromise, so you do not need to manually add new partitions whenever the table is updated.

+1
source

All Articles