Combine two avro schemes programmatically

I have two similar schemes in which only one nested field changes (it is called onefield in schema1 and anotherfield in scheme2).

SCHEMA1

 { "type": "record", "name": "event", "namespace": "foo", "fields": [ { "name": "metadata", "type": { "type": "record", "name": "event", "namespace": "foo.metadata", "fields": [ { "name": "onefield", "type": [ "null", "string" ], "default": null } ] }, "default": null } ] } 

SCHEMA2

 { "type": "record", "name": "event", "namespace": "foo", "fields": [ { "name": "metadata", "type": { "type": "record", "name": "event", "namespace": "foo.metadata", "fields": [ { "name": "anotherfield", "type": [ "null", "string" ], "default": null } ] }, "default": null } ] } 

I can programmatically combine both circuits with avro 1.8.0:

 Schema s1 = new Schema.Parser().parse(schema1); Schema s2 = new Schema.Parser().parse(schema2); Schema[] schemas = {s1, s2}; Schema mergedSchema = null; for (Schema schema: schemas) { mergedSchema = AvroStorageUtils.mergeSchema(mergedSchema, schema); } 

and use it to convert input json to avro or json view:

 JsonAvroConverter converter = new JsonAvroConverter(); try { byte[] example = new String("{}").getBytes("UTF-8"); byte[] avro = converter.convertToAvro(example, mergedSchema); byte[] json = converter.convertToJson(avro, mergedSchema); System.out.println(new String(json)); } catch (AvroConversionException e) { e.printStackTrace(); } 

This code shows the expected result: {"metadata":{"onefield":null,"anotherfield":null}} . The problem is that I cannot see the combined circuit. If I make a simple System.out.println(mergedSchema) , I get the following exception:

 Exception in thread "main" org.apache.avro.SchemaParseException: Can't redefine: merged schema (generated by AvroStorage).merged at org.apache.avro.Schema$Names.put(Schema.java:1127) at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:561) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:689) at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:715) at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:700) at org.apache.avro.Schema.toString(Schema.java:323) at org.apache.avro.Schema.toString(Schema.java:313) at java.lang.String.valueOf(String.java:2982) at java.lang.StringBuilder.append(StringBuilder.java:131) 

I call this the avro uncertainty principle :). It looks like avro can work with a federated schema, but it fails when it tries to convert the schema to JSON. Merging works with simpler schemes, so for me it sounds like an error in avro 1.8.0.

Do you know what can happen or how to solve it? Any workarounds (e.g. alternative Schema serializers) are welcome.

+7
java avro
source share
1 answer

I found the same problem with the pig utility class ... actually there are 2 errors here.

  • AVRO allows serialization of data via GenericDatumWriter using an invalid schema
  • The piggybank util class generates invalid schemas because it uses the same name / namespace for all merged fields (instance to save the original name)

This works correctly for more complex scripts https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/SchemaUtil .java # L511

  Schema mergedSchema = SchemaUtil.merge(s1, s2); 

In your example, I get the following output

 { "type": "record", "name": "event", "namespace": "foo", "fields": [ { "name": "metadata", "type": { "type": "record", "name": "event", "namespace": "foo.metadata", "fields": [ { "name": "onefield", "type": [ "null", "string" ], "default": null }, { "name": "anotherfield", "type": [ "null", "string" ], "default": null } ] }, "default": null } ] } 

Hope this helps others.

+1
source share

All Articles