OK! So I decided to use Parquet as the storage format for hive tables, and before I implemented it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests against the general idea that it was faster than text files.
Please note that I am using Hive-0.13 on MapR
Performs my workflow
Table a
Format - Text Format
Table Size - 2.5 GB
Table B
Format - Parquet
Table Size - 1.9 GB
[Create table B saved as parquet, how to select * from A]
Table C
Format - Parquet with instant compression
Table Size - 1.9 GB
[Create table C stored as tblproperties parquet ("parquet.compression" = "SNAPPY"), how to select * from A]
.
A
- 15
- 1
- 123.33
- 59.057
B
- 8
- 1
- 204,92
- 50.33
A
- 15
- 0
- 51,18
- 25.296
B
- 8
- 0
- 117,08
- 27,448
A
- 15
- 0
- 57,55
- 20.254
B
- 8
- 0
- 113,97
- 27.678
A
- 15
- 0
- 57,55
- 20.254
B
- 8
- 0
- 113,97
- 27.678
A
- 15
- 1
- 127,85
- 29,68
B
- 8
- 1
- 255,2
- 41,025
, , , Parquet , , .
C , , TextFile, .
-, , , ?
!
ORC . .
- 123,33 .
- 204,92
ORC - 119.99
ORC CPU SNAPPY - 107,05
- 127,85
- 255,2
ORC - 120,48
ORC SNAPPY Cumulative CPU - 98.27
- 128,79
- 211,73
ORC - 165,5
ORC CPU SNAPPY - 135,45
4 where
- 72,48
- 136,4
ORC - 96,63
ORC CPU SNAPPY - 82,05
, ORC , ? -, , ?
!