Custom input format with beehive

Question

Custom input format with beehive

Update . Well, it turns out that the reason below works is because I am using a newer version of the InputFormat API ( import org.apache.hadoop.mapred , which is old and import org.apache.hadoop.mapreduce , which is new ) The problem is porting existing code to new code. Has anyone had experience writing a multi-line InputFormat using the old API?

Attempted to process Omniture data log files using Hadoop / Hive. The file format is divided into a tab and, for the most part, quite simple, they allow you to have several new lines and tabs in a field that is escaped with a backslash ( \\n and \\t ). As a result, I decided to create my own InputFormat to handle multiple lines of a new line and convert these tabs to spaces when Hive tries to split the tabs. I just tried loading some sample data into a table in Hive and got the following error:

 CREATE TABLE (...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'; FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data

It is strange that my input format is expanded by org.apache.hadoop.mapreduce.lib.input.TextInputFormat ( https://gist.github.com/4a380409cd1497602906 ).

Does Hive require you to extend org.apache.hadoop.hive.ql.io.HiveInputFormat ? If so, do I need to rewrite any of my existing code classes for InputFormat and RecordReader, or can I just change the class that it extends?

+4

hadoop hive

Mike sukmanowsky Oct 7 '11 at 21:23

source share

1 answer

Mike sukmanowsky · Accepted Answer · 2011-10-08T21:50:43+0000

I found this out by looking at the code for LineReader and TextInputFormat. Created a new InputFormat to solve this problem, as well as for EscapedLineReader.

https://github.com/msukmanowsky/OmnitureDataFileInputFormat

Custom input format with beehive

More articles: