How to use "typedbytes" or "rawbytes" in a Hadoop stream?

Question

How to use "typedbytes" or "rawbytes" in a Hadoop stream?

I have a problem that will be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which will allow to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \ t and \ n, as delimiters and complains about non-utf-8 characters. Converting all of my binary data to Base64 will slow down the workflow, defeating the target.)

These binary modes were added by HADOOP-1722 . On the command line that invokes the Hadoop Streaming job, “-io rawbytes” allows you to define your data as a 32-bit integer size, followed by raw data of this size, and “-io typedbytes” allows you to define your data as 1 (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I created files with these formats (with one or more entries) and confirmed that they are in the correct format by checking them against / against the output of typedbytes.py . I also tried all conceivable options (big-endian, little-endian, different byte offsets, etc.). I use Hadoop 0.20 from CDH4 , which has classes that implement typedbytes processing, and it introduces these classes when the "-io" switch is set.

I copied the binary to HDFS using "hadoop fs -copyFromLocal". When I try to use it as input to set the map shrink, it crashes with OutOfMemoryError in the line where it tries to create an array of bytes of the length that I specify (for example, 3 bytes). It should incorrectly read the number and instead allocate a huge block. Despite this, he manages to get a record in mapper (previous record? Not sure), which writes it to a standard error so that I can see it. There are always too many bytes at the beginning of recording: for example, if the file is "\ x00 \ x00 \ x00 \ x00 \ x03hey", then "\ x04 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x07 \ x00 \ x00 \ x00 \ x08 \ x00 \ x00 \ x00 \ x00 \ x03hey "(reproducible bits, although I don’t see the pattern).

From page 5 of this talk , I found out that there are loadtb and dumptb loadable commands that are copied to / from HDFS and wrap / deploy typed bytes in SequenceFile in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly decompresses the SequenceFile, but then incorrectly interprets the typed elements contained inside, exactly the same.

Also, I cannot find documentation for this feature. February 7th (I emailed it to myself), it was briefly mentioned on the streaming.html page on Apache , but this r0.21.0 web page has since been removed and the equivalent page for r1.1.1 does not mention rawbytes or typedbytes.

So my question is: what is the correct way to use rawbytes or typedbytes in a Hadoop stream? Has anyone ever got it to work? If so, can anyone post a recipe? This seems to be a problem for those who want to use binary data in a Hadoop Streaming stream, which should be a fairly wide group.

PS I noticed that Dumbo , Hadoopy and rmr all use this function, but there should be a way to use it directly, without warning using Python or R-based data.

+4

binary hadoop streaming

Jim pivarski Mar 2 '13 at 6:26

source share

3 answers

We solved the binary data problem using hexaencoding data at the split level when streaming data to Mapper. This will use and increase the efficiency of parallel operation, instead of first converting your data before processing on node.

+1

ratang2000 24 sept '13 at 8:41

source share

There seems to be a patch for JustBytes I / O for streaming, which passes the entire input file to the mapper command:

https://issues.apache.org/jira/browse/MAPREDUCE-5018

0

b0fh Oct 27 '14 at 9:08

source share

Jim pivarski · Accepted Answer · 2013-03-02T08:37:55+0000

Ok, I found a combination that works, but it's weird.

Prepare a valid typedbytes file on the local file system by following the documentation or by emulating typedbytes.py .

Using

hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb

to wrap typedbytes in a SequenceFile and put it in HDFS in one step.

Using
```
 hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ... 
```
to start a map reduction job in which the cartographer gets input from a SequenceFile. Please note that -io typedbytes or -D stream.map.input=typedbytes should not be used --- explicitly requesting typedbytes leads to an incorrect interpretation, which I described in my question. But don't be afraid: Hadoop Streaming splits the input at its binary record boundaries, rather than the '\ n' characters. Data is sent to mapper as "rawdata", separated by "\ t" and "\ n", for example:
- 32-bit signed integer representing length (note: type character)
- a block of raw binary with this length: this is the key
- '\ t' (tab character ... why?)
- 32-bit signed integer representing length
- a block of raw binary with this length: this is the value
- '\ n' (newline ...?)
If you want to additionally send raw data from mapper to the reducer, add
```
 -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes 
```
into your Hadoop command line and format the mapper output and the expected gear input as valid typedbytes. They also alternate for key-value pairs, but this time with a character type and without '\ t' and '\ n'. Hadoop Streaming correctly separates these pairs on its binary record boundaries and key groups.

The only documentation for stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting February 6, 09. (In the previous discussion, we discussed a different way of parameterizing the formats.)

This recipe does not provide strong input for input: type characters are lost somewhere in the process of creating a SequenceFile and are interpreted using -inputformat . However, it does provide separation at the boundaries of the binary not "\ n", which is a really important thing, and strong typing between the converter and the gearbox.

How to use "typedbytes" or "rawbytes" in a Hadoop stream?

More articles: