I am trying to play with the Google ngrams dataset using Amazon Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670 , and I want to use Hadoop streams.
For input files, he says: “We store the data sets in one object in Amazon S3. The file is in a sequential file format with block level LZO compression. The key of the sequence file is the line number of the data set that is stored as LongWritable, and the value is raw data stored as TextWritable. "
What do I need to do to process these input files using Hadoop Streaming?
I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this does not seem to work - my jobs do not work for some vague reason. Are there any other arguments that I am missing?
I tried using a very simple identifier for both my cartographer and gearbox
STDIN.each do |line|
puts line
end
but it does not work.
source
share