Update . Well, it turns out that the reason below works is because I am using a newer version of the InputFormat API ( import org.apache.hadoop.mapred , which is old and import org.apache.hadoop.mapreduce , which is new ) The problem is porting existing code to new code. Has anyone had experience writing a multi-line InputFormat using the old API?
Attempted to process Omniture data log files using Hadoop / Hive. The file format is divided into a tab and, for the most part, quite simple, they allow you to have several new lines and tabs in a field that is escaped with a backslash ( \\n and \\t ). As a result, I decided to create my own InputFormat to handle multiple lines of a new line and convert these tabs to spaces when Hive tries to split the tabs. I just tried loading some sample data into a table in Hive and got the following error:
CREATE TABLE (...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'; FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data
It is strange that my input format is expanded by org.apache.hadoop.mapreduce.lib.input.TextInputFormat ( https://gist.github.com/4a380409cd1497602906 ).
Does Hive require you to extend org.apache.hadoop.hive.ql.io.HiveInputFormat ? If so, do I need to rewrite any of my existing code classes for InputFormat and RecordReader, or can I just change the class that it extends?
source share