I have a problem that will be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which will allow to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \ t and \ n, as delimiters and complains about non-utf-8 characters. Converting all of my binary data to Base64 will slow down the workflow, defeating the target.)
These binary modes were added by HADOOP-1722 . On the command line that invokes the Hadoop Streaming job, โ-io rawbytesโ allows you to define your data as a 32-bit integer size, followed by raw data of this size, and โ-io typedbytesโ allows you to define your data as 1 (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I created files with these formats (with one or more entries) and confirmed that they are in the correct format by checking them against / against the output of typedbytes.py . I also tried all conceivable options (big-endian, little-endian, different byte offsets, etc.). I use Hadoop 0.20 from CDH4 , which has classes that implement typedbytes processing, and it introduces these classes when the "-io" switch is set.
I copied the binary to HDFS using "hadoop fs -copyFromLocal". When I try to use it as input to set the map shrink, it crashes with OutOfMemoryError in the line where it tries to create an array of bytes of the length that I specify (for example, 3 bytes). It should incorrectly read the number and instead allocate a huge block. Despite this, he manages to get a record in mapper (previous record? Not sure), which writes it to a standard error so that I can see it. There are always too many bytes at the beginning of recording: for example, if the file is "\ x00 \ x00 \ x00 \ x00 \ x03hey", then "\ x04 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x07 \ x00 \ x00 \ x00 \ x08 \ x00 \ x00 \ x00 \ x00 \ x03hey "(reproducible bits, although I donโt see the pattern).
From page 5 of this talk , I found out that there are loadtb and dumptb loadable commands that are copied to / from HDFS and wrap / deploy typed bytes in SequenceFile in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly decompresses the SequenceFile, but then incorrectly interprets the typed elements contained inside, exactly the same.
Also, I cannot find documentation for this feature. February 7th (I emailed it to myself), it was briefly mentioned on the streaming.html page on Apache , but this r0.21.0 web page has since been removed and the equivalent page for r1.1.1 does not mention rawbytes or typedbytes.
So my question is: what is the correct way to use rawbytes or typedbytes in a Hadoop stream? Has anyone ever got it to work? If so, can anyone post a recipe? This seems to be a problem for those who want to use binary data in a Hadoop Streaming stream, which should be a fairly wide group.
PS I noticed that Dumbo , Hadoopy and rmr all use this function, but there should be a way to use it directly, without warning using Python or R-based data.