Hive / ElasticMapreduce: How to get JsonSerDe to ignore garbled JSON?

I am new to Hive and ElasticMapreduce and I am currently facing a specific issue. When a Hive statement is run on a table with billions of rows of JSON objects, the MapReduce job fails as soon as one of these rows is invalid / invalid. JSON

An exception:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {"ip":"39488130","cdate":"2012-08-09","cdate_ts":"2012-08-09 17:06:41","country":"SA","city":"Riyadh","mid":"6666276582211270592","osversion":"5.1. 1 at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {"ip":"39488130","cdate":"2012-08-09","cdate_ts":"2012-08-09 17:06:41","country":"SA","city":"Riyadh","mid":"6666276582211270592","osversion":"5.1.1 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:524) at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143) ... 8 more Caused by: com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Unterminated string near at com.google.gson.Streams.parse(Streams.java:51) at com.google.gson.JsonParser.parse(JsonParser.java:83) at com.google.gson.JsonParser.parse(JsonParser.java:58) at com.google.gson.JsonParser.parse(JsonParser.java:44) at com.amazon.elasticmapreduce.JsonSerde.deserialize(Unknown Source) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:510) ... 9 more Caused by: com.google.gson.stream.MalformedJsonException: Unterminated string near at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110) at com.google.gson.stream.JsonReader.nextString(JsonReader.java:967) at com.google.gson.stream.JsonReader.nextValue(JsonReader.java:802) at com.google.gson.stream.JsonReader.objectValue(JsonReader.java:782) at com.google.gson.stream.JsonReader.quickPeek(JsonReader.java:377) at com.google.gson.stream.JsonReader.peek(JsonReader.java:340) at com.google.gson.Streams.parseRecursive(Streams.java:60) at com.google.gson.Streams.parseRecursive(Streams.java:83) at com.google.gson.Streams.parse(Streams.java:40) ... 14 more 

I create my tables as follows:

 CREATE EXTERNAL TABLE IF NOT EXISTS table1 ( column1 string, column2 string ) PARTITIONED BY (year string, month string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ('paths'='c1, c2') LOCATION 's3://mybucket/table1'; 

What can I do to prevent a crash? Ignoring invalid JSON objects / lines would be fine, as its one of the billions that has been garbled.

Thanks for your help in advance. Best, Sasha

+4
source share
3 answers

By changing the class used in the string format and adding the "malformed" property, you can make table creation work with garbled JSON:

 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ("ignore.malformed.json" = "true") LOCATION ... 

Enable the JAR using the "hive.aux.jars.path" property in the "hive-site.xml" or "ADD JAR" Hive file. Here you can find the JAR or compile it from this source .

+3
source

Apache's JsonSerDe seems to ignore invalid JSON strings ... http://code.google.com/p/hive-json-serde/

0
source

Basically, the above error is due to an invalid JSON string. Try to solve this problem.

Reject to avoid crashing your application, catch this exception in the catch catch block and the procedure below. So your application will not crash.

-2
source

All Articles