Using FileFormat v Serde to Read Custom Text Files

Hadoop / Hive is new here. I am trying to use data stored in a custom text format using Hive. As far as I understand, you can write your own FileFormat or a custom SerDe class for this. Is this the case or I don’t understand it? And what are some general guidelines for choosing an option when? Thanks!

+7
source share
4 answers

I get it. In the end, I didn't have to write serde, I wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat ) that returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V> ). RecordReader implements the logic for reading and analyzing my files and returns tab-delimited strings.

With this, I declared my table as

 create table t2 ( field1 string, .. fieldNN float) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; 

It uses the native SerDe. In addition, you must specify the output format when using a custom input format, so I choose one of the built-in output formats.

+11
source

Basically, you need to understand the difference in that when you need to modify SerDe and when to modify fileformat.

From the official documentation: Hive SerDe

What is SerDe? 1.SerDe - short name "Serializer and Deserializer". 2.Hive uses SerDe (and FileFormat) to read and write table rows. 3.HDFS files -> InputFileFormat -> -> Deserializer -> String object 4.Row Object -> Serializer -> -> OutputFileFormat -> HDFS Files

So, the 3rd and 4th points clearly indicate the difference. To read a record in different ways than usual, you need to have your own file format (input / output), where the records are separated by the "\ n" character. And you need to configure SerDe when you want to interpret the read records in your own way.

Take the example of the widely used JSON format.

Scenario 1: Say you have a json input file, where one line contains one json entry. So, now you just need a Custom Serde to interpret the read record the way you want. There is no need for a custom inout format, since 1 line will be 1 record.

Scenario 2: Now, if you have an input file in which your single json record spans several lines and you want to read it, then you must first write a custom input format for reading in 1 json record, and then this read The json record will go to Custom SerDe.

+6
source

Depends on what you get from the text file.

You can write your own record reader to parse the text log file and return it the way you want, the input format class does the job for you. You will use this jar to create a Hive table and load data into this table.

Speaking of SerDe, I use it a little differently. I use both InputFormat and SerDe, the former to analyze the actual data, and the latter to stabilize my metadata, which represent the actual data. The reason why I do this? I want to create the corresponding columns (no more or less) in the hive table for every row of my log file that I have, and I think SerDe is the perfect solution for me.

In the end, I map these two together to create the final table if I want or save these tables so that I can do joins to query from them.

I like the explanation of the Cloudera blog.

http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

+2
source

If you are using Hive, write serde. See the following examples: https://github.com/apache/hive/tree/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2

Please note that this interface is specific to Hive. If you want to use your own file format for regular hadoop jobs, you will have to implement a separate interface (I'm not quite sure which one).

If you already know how to deserialize data in another language, you can simply write a streaming task (using any language) and use existing libraries.

Hope that helps

+1
source

All Articles