Avro Text file created by Flume Twitter Agent that cannot be read in Java

It is not possible to read and analyze a file created by streaming twitter data using the Flute twitter agent without using Java and Avro Tools. My requirement is to convert avro format to JSON format.

When using any of the methods, I get an exception: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

I am using the hadoop vanilla configuration in pseudo node and the hadoop version is 2.7.1

Version for flume - 1.6.0

The flume configuration file for twitter agent and the java code for parsing the avro file are attached below:

 TwitterAgent.sources=Twitter TwitterAgent.channels=MemChannel TwitterAgent.sinks=HDFS TwitterAgent.sources.Twitter.type=org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.channels=MemChannel TwitterAgent.sources.Twitter.consumerKey=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.consumerSecret=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessToken=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessTokenSecret=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.keywords=Modi,PMO,Narendra Modi,BJP TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/user/ashish/Twitter_Data TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=100 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10 TwitterAgent.sinks.HDFS.hdfs.rollInterval=30 TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=100 

 import org.apache.avro.file.DataFileReader; import org.apache.avro.file.FileReader; import org.apache.avro.file.SeekableInput; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.mapred.FsInput; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import java.io.IOException; public class AvroReader { public static void main(String[] args) throws IOException { Path path = new Path("hdfs://localhost:9000/user/ashish/Twitter_Data/FlumeData.1449656815028"); Configuration config = new Configuration(); SeekableInput input = new FsInput(path, config); DatumReader<GenericRecord> reader = new GenericDatumReader<>(); FileReader<GenericRecord> fileReader = DataFileReader.openReader(input, reader); for (GenericRecord datum : fileReader) { System.out.println("value = " + datum); } fileReader.close(); } } 

Exceptional stack trace I received:

 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใƒฆใ‚ฆใ‚ตใƒช", "user_description": "ใ€Œใƒ†ใ‚ฌใƒŸใƒใƒใ€ใซ็™ปๅ ดใ™ใ‚‹ใ‚ถใ‚ธใฎbotใงใ™ใ€‚่ฟฝๅŠ ใ—ใฆใปใ—ใ„่จ€่‘‰ใชใฉใฎๅธŒๆœ›ใŒใ‚ใ‚Œใฐ๏ผค๏ผญใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚ใƒชใƒ ใƒผใƒ–ใ™ใ‚‹้š›ใฏใƒ–ใƒญใƒƒใ‚ฏใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใ‚ถใ‚ธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใŠๅ‰ใ€ใชใ‚“ใ‹ใ€\u2026ใ™ใฃใ’ใƒผๆฅฝใ—ใใ†ใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more ใ‚ตใƒช", "user_description": " 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใƒฆใ‚ฆใ‚ตใƒช", "user_description": "ใ€Œใƒ†ใ‚ฌใƒŸใƒใƒใ€ใซ็™ปๅ ดใ™ใ‚‹ใ‚ถใ‚ธใฎbotใงใ™ใ€‚่ฟฝๅŠ ใ—ใฆใปใ—ใ„่จ€่‘‰ใชใฉใฎๅธŒๆœ›ใŒใ‚ใ‚Œใฐ๏ผค๏ผญใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚ใƒชใƒ ใƒผใƒ–ใ™ใ‚‹้š›ใฏใƒ–ใƒญใƒƒใ‚ฏใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใ‚ถใ‚ธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใŠๅ‰ใ€ใชใ‚“ใ‹ใ€\u2026ใ™ใฃใ’ใƒผๆฅฝใ—ใใ†ใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 moreใงใ™่ฟฝๅŠ ใ—ใฆใปใ—ใ„่จ€่‘‰ใชใฉใฎๅธŒๆœ›ใŒ. 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใƒฆใ‚ฆใ‚ตใƒช", "user_description": "ใ€Œใƒ†ใ‚ฌใƒŸใƒใƒใ€ใซ็™ปๅ ดใ™ใ‚‹ใ‚ถใ‚ธใฎbotใงใ™ใ€‚่ฟฝๅŠ ใ—ใฆใปใ—ใ„่จ€่‘‰ใชใฉใฎๅธŒๆœ›ใŒใ‚ใ‚Œใฐ๏ผค๏ผญใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚ใƒชใƒ ใƒผใƒ–ใ™ใ‚‹้š›ใฏใƒ–ใƒญใƒƒใ‚ฏใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใ‚ถใ‚ธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใŠๅ‰ใ€ใชใ‚“ใ‹ใ€\u2026ใ™ใฃใ’ใƒผๆฅฝใ—ใใ†ใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more 

Also I need to give the correct Avro scheme for the Avro file, if so?

+7
java avro flume flume-ng flume-twitter
source share
1 answer

I also met this problem. Although I can see your data file that no longer exists. I checked this data file, which should match yours.

I found that my data file was already an avro container file, which means it has its own schema and data.

The avro file I received was very wrong, because it should only include one chapter containing the avro scheme, but in fact it has several goals in its file.

Another thing is that tweets are already JSON format, why does flume convert them to avro format?

0
source share

All Articles