Does Hive apply schema while reading?

What is the difference and significance of these two statements that I met during the lecture here:

1. Traditional databases enforce schema during load time. 

and

 2. Hive enforces schema during read time. 
+8
mapreduce hadoop hive hdfs
source share
1 answer

You address one of the reasons why Hadoop and other NoSQL strategies were so successful, so I'm not sure if you expect to get a dissertation or not, but here it is! The added flexibility and flexibility in data analysis probably contributed to the explosion of โ€œdata scienceโ€, simply because it facilitates large-scale data analysis in general.

A traditional relational database stores data based on the schema. He knows that the second column is an integer, he knows that he has 40 columns, etc. Therefore, you need to specify your scheme in advance and plan it well. This is a "write circuit" - that is, a circuit is applied when data is written to the data warehouse.

Hive (in some cases), Hadoop, and many other NoSQL systems are generally referred to as a "read-only scheme" - the scheme applies when data is read from the data store. Consider the following line of source code:

 A:B:C~E:F~G:H~~I::J~K~L 

There are several ways to interpret this. ~ may be a delimiter or perhaps : may be a delimiter. Who knows? With a reading circuit, this does not matter. You decide what a circuit is when you analyze data, not when you record data. This example is a little ridiculous in the fact that you will probably never come across this case, but it is hopefully finding a point.

With a read schema, you simply upload your data to the data warehouse and think about how to analyze and interpret later. At the heart of this explanation, a reading scheme means writing down your data first, finding out what is later. A recording scheme means that your data is first, and then write it after.


There is a compromise here. Some of them are subjective and have their own opinions.

Advantages of a recording scheme:

  • Improved security and data cleansing for data at rest
  • Usually more efficient (storage size and computational), since the data is already parsed

The disadvantages of the circuit when recording:

  • You need to plan ahead what your schema is before storing data (i.e. you must do ETL)
  • Usually you throw away the source data, which can be bad if you have a mistake in the process of swallowing.
  • It is more difficult to have different representations of the same data.

Reading Benefits:

  • Flexibility in determining how your data is interpreted at boot time
    • This gives you the opportunity to develop your โ€œpatternโ€ over time.
    • This allows you to have different versions of your "schema"
    • This allows you to change the original source data format without consolidation into a single data format.
  • You can save your original data.
  • You can upload your data before you know what to do with it (so you don't drop it to the ground)
  • Provides flexibility in storing unstructured, unclean and / or unorganized data.

Disadvantages of reading scheme:

  • It is generally less efficient because you have to re-process and re-interpret the data each time (this can be expensive with formats such as XML).
  • Data is not self-documenting (i.e. you cannot look at a diagram to find out what data is)
  • More bugs and your analytics have to consider dirty data
+15
source share

All Articles