Hadoop ORC File - How It Works - How to Get Metadata

I am new to ORC file. I went through many blogs, but did not understand a clear understanding. Please help and clarify the questions below.

  • Can I get a schema from an ORC file? I know that in Avro there may be a scheme.

  • How does this actually ensure the evolution of the circuit? I know that you can add multiple columns. But how to do that. The only thing I know about creating an orc file is loading data into a hive table that stores data in orc format.

  • How does the ORC file index work? What I know for each band index will be supported. But since the file is not sorted, how does it help you find data in the list of bands. How does it help to skip bands when searching for data?

  • An index is maintained for each column. If so, is it going to consume more memory?

  • As an ORC file, a column format can fit into a hive table where the values ​​of each column are stored together. whereas a catch table is made to record record by record. How will both fit together?

+5
source share
2 answers

1. and 2. Use Hive and / or HCatalog to create, read, update the structure of the ORC table in the Hive metastar (HCatalog is only a side door than Pig / Sqoop / Spark / allows, regardless of access to the metastore directly)

2. ALTER TABLE command allows you to add / remove columns regardless of the type of storage, including ORC. But be careful with an unpleasant error , which can then discard vectorized reads (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically, this minimum / maximum information was stored in the band footer during recording, and then used during reading to skip all bands that clearly do not comply with WHERE requirements, drastically reducing I / O in some cases (a trick that has become popular in repositories columns, for example, InfoBright in MySQL, but also in Oracle Exadata devices [duplicated "intelligent scanning" by Oracle marketing])

5. Hive works with the formats "row storage" (Text, SequenceFile, AVRO) and "column storage" (ORC, Parquet). The optimizer uses only certain strategies and shortcuts in the initial phase of the map - for example, strip removal, vectorized operators - and, of course, the serialization / deserialization phases are a bit more complicated with column stores.

+3
source

Hey, I can’t help you with all your questions, but I will try it.

  • you can use the filedump utility to read the metadata of the ORC file, see here

  • I am very unsure of the evolution of the circuit, but as far as I know, ORC does not support evolution.

  • The ORC index stores the sum of min and max, so if your data is completely unstructured, you probably still have to read a lot of data. But since the last version of ORC, you can use the optional Bloom-Filter, which is more accurate in eliminating a group of strings. Perhaps this could also be a useful orc user mailing list.

  • ORC provides an index for each column, but it is just a light weight index. You save information about min / max and the sum on the numerical columns in the file football, stripefooter and by default every 10,000 lines. therefore it does not take up so much space

  • If you store your table in Orc Fileformat, Hive will use a specific ORC Recordreader to extract rows from columns. The advantage of column storage is that you do not need to read an entire row

0
source

All Articles