Hadoop Streaming with SequenceFile (on AWS)

I have a large number of Hadoop SequenceFiles that I would like to process using Hadoop on AWS. Most of my existing code is written in Ruby, so I would like to use Hadoop Streaming along with my custom Ruby Mapper and Reducer scripts on Amazon EMR.

I cannot find documentation on how to integrate sequence files with a Hadoop stream and how input will be provided for my ruby ​​scripts. I would appreciate some instructions on how to run jobs (either directly on EMR, or simply on the normal Hadoop command line) to use SequenceFiles and some information on how to expect data to be provided to my script.

- Editing: I previously mentioned StreamFiles, not SequenceFiles. I think the documentation for my data was wrong, but an apology. The answer is easily replaced.

+4
source share
2 answers

The answer to this question is to specify the input format as a command line argument for Hadoop.

-inputformat SequenceFileAsTextInputFormat

Most likely, you want the SequenceFile to be like text, but there is SequenceFileAsBinaryInputFormat , if this is more appropriate.

+1
source

Not sure if this is what you are asking for, but the ruby ​​card command reduces scripts with the hadoop command line, it looks something like this:

 % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input input/ncdc/sample.txt \ -output output \ -mapper ch02/src/main/ruby/max_temperature_map.rb \ -reducer ch02/src/main/ruby/max_temperature_reduce.rb 

You can (and should) use a combiner with large datasets. Add it with the -combiner option. The combiner output will be fed directly to your cartographer (but does not guarantee how many times this will be triggered, if at all). Otherwise, your input will be split (in accordance with the standard hadoop protocal protocol) and fed directly to your cartographer. Example from O'Reily Hadoop: The Definitive Guide 3rd Edition. It has very good streaming information and a section on streaming using ruby.

0
source

All Articles