It is not possible to read and analyze a file created by streaming twitter data using the Flute twitter agent without using Java and Avro Tools. My requirement is to convert avro format to JSON format.
When using any of the methods, I get an exception: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
I am using the hadoop vanilla configuration in pseudo node and the hadoop version is 2.7.1
Version for flume - 1.6.0
The flume configuration file for twitter agent and the java code for parsing the avro file are attached below:
TwitterAgent.sources=Twitter TwitterAgent.channels=MemChannel TwitterAgent.sinks=HDFS TwitterAgent.sources.Twitter.type=org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.channels=MemChannel TwitterAgent.sources.Twitter.consumerKey=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.consumerSecret=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessToken=xxxxxxxxxxxxxxxx TwitterAgent.sources.Twitter.accessTokenSecret=xxxxxxxxxxxxxx TwitterAgent.sources.Twitter.keywords=Modi,PMO,Narendra Modi,BJP TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/user/ashish/Twitter_Data TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=100 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10 TwitterAgent.sinks.HDFS.hdfs.rollInterval=30 TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=100
import org.apache.avro.file.DataFileReader; import org.apache.avro.file.FileReader; import org.apache.avro.file.SeekableInput; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.mapred.FsInput; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import java.io.IOException; public class AvroReader { public static void main(String[] args) throws IOException { Path path = new Path("hdfs://localhost:9000/user/ashish/Twitter_Data/FlumeData.1449656815028"); Configuration config = new Configuration(); SeekableInput input = new FsInput(path, config); DatumReader<GenericRecord> reader = new GenericDatumReader<>(); FileReader<GenericRecord> fileReader = DataFileReader.openReader(input, reader); for (GenericRecord datum : fileReader) { System.out.println("value = " + datum); } fileReader.close(); } }
Exceptional stack trace I received:
2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใฆใฆใตใช", "user_description": "ใใใฌใใใใใซ็ปๅ ดใใใถใธใฎbotใงใใ่ฟฝๅ ใใฆใปใใ่จ่ใชใฉใฎๅธๆใใใใฐ๏ผค๏ผญใงใ้กใใใพใใใชใ ใผใใใ้ใฏใใญใใฏใงใ้กใใใพใใ", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใถใธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใๅใใชใใใ\u2026ใใฃใใผๆฅฝใใใใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more ใตใช", "user_description": " 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใฆใฆใตใช", "user_description": "ใใใฌใใใใใซ็ปๅ ดใใใถใธใฎbotใงใใ่ฟฝๅ ใใฆใปใใ่จ่ใชใฉใฎๅธๆใใใใฐ๏ผค๏ผญใงใ้กใใใพใใใชใ ใผใใใ้ใฏใใญใใฏใงใ้กใใใพใใ", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใถใธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใๅใใชใใใ\u2026ใใฃใใผๆฅฝใใใใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 moreใงใ่ฟฝๅ ใใฆใปใใ่จ่ใชใฉใฎๅธๆใ. 2015-12-09 17:48:19,291 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable value = {"id": "674535686809120768", "user_friends_count": 1260, "user_location": "ใฆใฆใตใช", "user_description": "ใใใฌใใใใใซ็ปๅ ดใใใถใธใฎbotใงใใ่ฟฝๅ ใใฆใปใใ่จ่ใชใฉใฎๅธๆใใใใฐ๏ผค๏ผญใงใ้กใใใพใใใชใ ใผใใใ้ใฏใใญใใฏใงใ้กใใใพใใ", "user_statuses_count": 47762, "user_followers_count": 1153, "user_name": "ใถใธ", "user_screen_name": "zazie_bot", "created_at": "2015-12-09T15:56:54Z", "text": "@ill_akane_bot ใๅใใชใใใ\u2026ใใฃใใผๆฅฝใใใใ ใช\u2026", "retweet_count": 0, "retweeted": false, "in_reply_to_user_id": 204695477, "source": "<a href=\"http:\/\/twittbot.net\/\" rel=\"nofollow\">twittbot.net<\/a>", "in_reply_to_status_id": 674535430423887872, "media_url_https": null, "expanded_url": null} Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at avro.AvroReader.main(AvroReader.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40 at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266) ... 7 more
Also I need to give the correct Avro scheme for the Avro file, if so?