Deserialize Avro file with C #

I cannot find a way to deserialize an Apache Avro file with C #. An Avro file is a file created by Archive Function in Microsoft Azure Event Hubs.

Using Java, I can use Avro Tools from Apache to convert the file to JSON:

java -jar avro-tools-1.8.1.jar tojson --pretty inputfile > output.json 

Using the NuGet package Microsoft.Hadoop.Avro ​​I can extract SequenceNumber , Offset and EnqueuedTimeUtc , but since I don’t know which type to use for the Body an exception is thrown. I tried with Dictionary<string, object> and other types.

 static void Main(string[] args) { var fileName = "..."; using (Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read)) { using (var reader = AvroContainer.CreateReader<EventData>(stream)) { using (var streamReader = new SequentialReader<EventData>(reader)) { var record = streamReader.Objects.FirstOrDefault(); } } } } [DataContract(Namespace = "Microsoft.ServiceBus.Messaging")] public class EventData { [DataMember(Name = "SequenceNumber")] public long SequenceNumber { get; set; } [DataMember(Name = "Offset")] public string Offset { get; set; } [DataMember(Name = "EnqueuedTimeUtc")] public string EnqueuedTimeUtc { get; set; } [DataMember(Name = "Body")] public foo Body { get; set; } // More properties... } 

The scheme is as follows:

 { "type": "record", "name": "EventData", "namespace": "Microsoft.ServiceBus.Messaging", "fields": [ { "name": "SequenceNumber", "type": "long" }, { "name": "Offset", "type": "string" }, { "name": "EnqueuedTimeUtc", "type": "string" }, { "name": "SystemProperties", "type": { "type": "map", "values": [ "long", "double", "string", "bytes" ] } }, { "name": "Properties", "type": { "type": "map", "values": [ "long", "double", "string", "bytes" ] } }, { "name": "Body", "type": [ "null", "bytes" ] } ] } 
+13
source share
7 answers

I managed to get full access to data using dynamic . Here is the code for accessing the raw body data, which is stored as an array of bytes. In my case, these bytes contain UTF8 encoded JSON, but of course it depends on how you originally created the EventData instances that you posted to the Event Hub:

 using (var reader = AvroContainer.CreateGenericReader(stream)) { while (reader.MoveNext()) { foreach (dynamic record in reader.Current.Objects) { var sequenceNumber = record.SequenceNumber; var bodyText = Encoding.UTF8.GetString(record.Body); Console.WriteLine($"{sequenceNumber}: {bodyText}"); } } } 

If someone can post a statically typed solution, I will choose it, but given that a large delay in any system would almost certainly be related to Event Hub blogs, I would not worry about the parsing performance. :)

+6
source

This Gist shows how to deserialize event node capture using C # using Microsoft.Hadoop.Avro2, which has the advantage that both .NET Framework 4.5 and .NET Standard 1.6:

  var connectionString = "<Azure event hub capture storage account connection string>"; var containerName = "<Azure event hub capture container name>"; var blobName = "<Azure event hub capture BLOB name (ends in .avro)>"; var storageAccount = CloudStorageAccount.Parse(connectionString); var blobClient = storageAccount.CreateCloudBlobClient(); var container = blobClient.GetContainerReference(containerName); var blob = container.GetBlockBlobReference(blobName); using (var stream = blob.OpenRead()) using (var reader = AvroContainer.CreateGenericReader(stream)) while (reader.MoveNext()) foreach (dynamic result in reader.Current.Objects) { var record = new AvroEventData(result); record.Dump(); } public struct AvroEventData { public AvroEventData(dynamic record) { SequenceNumber = (long) record.SequenceNumber; Offset = (string) record.Offset; DateTime.TryParse((string) record.EnqueuedTimeUtc, out var enqueuedTimeUtc); EnqueuedTimeUtc = enqueuedTimeUtc; SystemProperties = (Dictionary<string, object>) record.SystemProperties; Properties = (Dictionary<string, object>) record.Properties; Body = (byte[]) record.Body; } public long SequenceNumber { get; set; } public string Offset { get; set; } public DateTime EnqueuedTimeUtc { get; set; } public Dictionary<string, object> SystemProperties { get; set; } public Dictionary<string, object> Properties { get; set; } public byte[] Body { get; set; } } 
  • NuGet Links:

    • Microsoft.Hadoop.Avro2 (1.2.1 works)
    • WindowsAzure.Storage (8.3.0 works)
  • Namespace:

    • Microsoft.Hadoop.Avro.Container
    • Microsoft.WindowsAzure.Storage
+6
source

Finally, I was able to get this to work with the Apache C # library / infrastructure.
I lingered for a while, because the Capture feature of Azure Event Hubs sometimes displays a file without message content. Perhaps I also had a problem with how the messages were originally serialized into an EventData object.
The code below was for a file saved on disk from a block capture container.

 var dataFileReader = DataFileReader<EventData>.OpenReader(file); foreach (var record in dataFileReader.NextEntries) { // Do work on EventData object } 

This also works using the GenericRecord object.

 var dataFileReader = DataFileReader<GenericRecord>.OpenReader(file); 

It took some effort to find out. However, I now agree that the Azure Event Hubs Capture feature is a great feature for backing up all events. I still feel that they should make the format optional, as with Stream Analytic, but maybe I'm used to Avro.

+4
source

Your remaining types, I suspect, should be defined as:

 [DataContract(Namespace = "Microsoft.ServiceBus.Messaging")] [KnownType(typeof(Dictionary<string, object>))] public class EventData { [DataMember] public IDictionary<string, object> SystemProperties { get; set; } [DataMember] public IDictionary<string, object> Properties { get; set; } [DataMember] public byte[] Body { get; set; } } 

Even if Body is a union of null and bytes , this maps to a nullable byte[] .

In C #, arrays are always reference types, so it can be null and the contract is executed.

0
source

You can also use the NullableSchema attribute to mark the body as a union of bytes and zeros. This will allow you to use a strongly typed interface.

 [DataContract(Namespace = "Microsoft.ServiceBus.Messaging")] public class EventData { [DataMember(Name = "SequenceNumber")] public long SequenceNumber { get; set; } [DataMember(Name = "Offset")] public string Offset { get; set; } [DataMember(Name = "EnqueuedTimeUtc")] public string EnqueuedTimeUtc { get; set; } [DataMember(Name = "Body")] [NullableSchema] public foo Body { get; set; } } 
0
source

I always get an exception. The specified argument is out of range. Parameter name: size and found that this issue has already been reported in this thread https://github.com/Azure/azure-sdk-for-net/ issues / 3709 . I am using .Net Core 2.2, Microsoft.Hadoop.Avro-Core 1.1.19, Microsoft.Azure.Storage.Blob 10.0.0

Any key to solve this problem, I tried a lot without luck?

0
source

For people having problems serializing / deserializing Apache Avro data in C #, I created a small library that is the interface for Microsoft.Hadoop.Avro:

https://github.com/AdrianStrugala/AvroConvert

https://www.nuget.org/packages/AvroConvert

Use is as simple as:

 byte[] avroFileContent = File.ReadAllBytes(fileName); Dictionary<string, object> result = AvroConvert.Deserialize(avroFileContent); //or if u know the model of the data MyModel result = AvroConvert.Deserialize<MyModel>(avroFileContent); 
0
source

All Articles