How to use Hadoop streams with LZO-compressed sequence files?

I am trying to play with the Google ngrams dataset using Amazon Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670 , and I want to use Hadoop streams.

For input files, he says: “We store the data sets in one object in Amazon S3. The file is in a sequential file format with block level LZO compression. The key of the sequence file is the line number of the data set that is stored as LongWritable, and the value is raw data stored as TextWritable. "

What do I need to do to process these input files using Hadoop Streaming?

I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this does not seem to work - my jobs do not work for some vague reason. Are there any other arguments that I am missing?

I tried using a very simple identifier for both my cartographer and gearbox

#!/usr/bin/env ruby

STDIN.each do |line|
  puts line
end

but it does not work.

+5
source share
4 answers

lzo is packaged as part of an elastic mapreduce, so there is no need to install anything.

I just tried this and it works ...

 hadoop jar ~ hadoop / contrib / streaming / hadoop-streaming.jar \
  -D mapred.reduce.tasks = 0 \
  -input s3n: //datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
  -inputformat SequenceFileAsTextInputFormat \
  -output test_output \
  -mapper org.apache.hadoop.mapred.lib.IdentityMapper
+6

Lzo Hadoop 0.20.x - . lzo, lzo hadoop.

Kevin Hadoop-lzo - , . . .

( ) lzo-devel . lzo , loo .

, reado-lzo readme, . hadoop-lzo-lib jar hadoop lzo. , ( ), .

Hadoop, linux. Solaris, , hadoop.

.

+3

lzo,

-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

. (, ) -inputformat.

Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690
0

All Articles