Avro Text file created by Flume Twitter Agent that cannot be read in Java

Question

Avro Text file created by Flume Twitter Agent that cannot be read in Java

It is not possible to read and analyze a file created by streaming twitter data using the Flute twitter agent without using Java and Avro Tools. My requirement is to convert avro format to JSON format.

When using any of the methods, I get an exception: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

I am using the hadoop vanilla configuration in pseudo node and the hadoop version is 2.7.1

Version for flume - 1.6.0

The flume configuration file for twitter agent and the java code for parsing the avro file are attached below:

 TwitterAgent.sources=Twitter TwitterAgent.channels=MemChannel TwitterAgent.sinks=HDFS TwitterAgent.sources.Twitter.type=org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.channels=MemChannel TwitterAgent.sources.Twitter.consumerKey=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.consumerSecret=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessToken=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessTokenSecret=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.keywords=Modi,PMO,Narendra Modi,BJP TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/user/ashish/Twitter_Data TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=100 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10 TwitterAgent.sinks.HDFS.hdfs.rollInterval=30 TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=100

 import org.apache.avro.file.DataFileReader; import org.apache.avro.file.FileReader; import org.apache.avro.file.SeekableInput; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.mapred.FsInput; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import java.io.IOException; public class AvroReader { public static void main(String[] args) throws IOException { Path path = new Path("hdfs://localhost:9000/user/ashish/Twitter_Data/FlumeData.1449656815028"); Configuration config = new Configuration(); SeekableInput input = new FsInput(path, config); DatumReader<GenericRecord> reader = new GenericDatumReader<>(); FileReader<GenericRecord> fileReader = DataFileReader.openReader(input, reader); for (GenericRecord datum : fileReader) { System.out.println("value = " + datum); } fileReader.close(); } }

Exceptional stack trace I received:

 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ユウサリ", "user_description": "「テガミバチ」に登場するザジのbotです。追加してほしい言葉などの希望があればＤＭでお願いします。リムーブする際はブロックでお願いします。", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ザジ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot お前、なんか、\u2026すっげー楽しそうだな\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more サリ", "user_description": " 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ユウサリ", "user_description": "「テガミバチ」に登場するザジのbotです。追加してほしい言葉などの希望があればＤＭでお願いします。リムーブする際はブロックでお願いします。", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ザジ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot お前、なんか、\u2026すっげー楽しそうだな\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 moreです追加してほしい言葉などの希望が. 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ユウサリ", "user_description": "「テガミバチ」に登場するザジのbotです。追加してほしい言葉などの希望があればＤＭでお願いします。リムーブする際はブロックでお願いします。", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ザジ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot お前、なんか、\u2026すっげー楽しそうだな\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more

Also I need to give the correct Avro scheme for the Avro file, if so?

+7

java avro flume flume-ng flume-twitter

Ashu Dec 10 '15 at 5:31

source share

1 answer

cdhit · Answer 1 · 2016-03-20T03:57:27+0000

I also met this problem. Although I can see your data file that no longer exists. I checked this data file, which should match yours.

I found that my data file was already an avro container file, which means it has its own schema and data.

The avro file I received was very wrong, because it should only include one chapter containing the avro scheme, but in fact it has several goals in its file.

Another thing is that tweets are already JSON format, why does flume convert them to avro format?

Avro Text file created by Flume Twitter Agent that cannot be read in Java

More articles: