How to get schematic / column names from a parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet

I tried to run hdfs dfs -text dir/part-m-00000.gz.parquet , but it was compressed, so I ran gunzip part-m-00000.gz.parquet , but it does not decompress the file, as it does not recognize the .parquet extension .

How to get schema / column names for this file?

+7
hadoop hdfs apache-pig parquet
source share
2 answers

You cannot β€œopen” the file using hdfs dfs -text because it is not a text file. Parquet files are written to disk differently compared to text files.

And on the same issue, the Parquet project provides parquet tools to complete the tasks you are trying to do. Open and see the scheme, data, metadata, etc.

Look at the project parquet tool (which is put simply, the jar file). parquet-tools

Also, Cloudera, which supports and makes a significant contribution to Parquet, also has a nice page with examples of using parquet tools. An example from this page for your use case is

 parquet-tools schema part-m-00000.parquet 

Place an order on the Cloudera page. Using parquet file format with Impala, Hive, Pig, HBase and MapReduce

+12
source share

Since this is not a text file, you cannot make "-text" on it. You can easily read it through Hive, even if you do not have installed parquet tools, if you can load this file into the Hive table.

0
source share

All Articles