How to get schematic / column names from a parquet file?

Question

How to get schematic / column names from a parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet

I tried to run hdfs dfs -text dir/part-m-00000.gz.parquet , but it was compressed, so I ran gunzip part-m-00000.gz.parquet , but it does not decompress the file, as it does not recognize the .parquet extension .

How to get schema / column names for this file?

+7

hadoop hdfs apache-pig parquet

Super_john Nov 24 '15 at 0:57

source share

2 answers

Since this is not a text file, you cannot make "-text" on it. You can easily read it through Hive, even if you do not have installed parquet tools, if you can load this file into the Hive table.

0

Daya venkatesan Nov 24 '15 at 3:58

source share

Urvishsinh mahida · Accepted Answer · 2015-11-24T01:07:31+0000

You cannot “open” the file using hdfs dfs -text because it is not a text file. Parquet files are written to disk differently compared to text files.

And on the same issue, the Parquet project provides parquet tools to complete the tasks you are trying to do. Open and see the scheme, data, metadata, etc.

Look at the project parquet tool (which is put simply, the jar file). parquet-tools

Also, Cloudera, which supports and makes a significant contribution to Parquet, also has a nice page with examples of using parquet tools. An example from this page for your use case is

 parquet-tools schema part-m-00000.parquet

Place an order on the Cloudera page. Using parquet file format with Impala, Hive, Pig, HBase and MapReduce

How to get schematic / column names from a parquet file?

More articles: