Pydoop on Amazon EMR

How would I use Pydoop on Amazon EMR?

I tried to google this topic to no avail: is this even possible?

+8
python amazon-web-services hadoop emr amazon-emr
source share
2 answers

I finally got this job. Everything happens on the master node ... ssh before this node as a hasoop user

You need packages:

sudo easy_install argparse importlib sudo apt-get update sudo apt-get install libboost-python-dev 

To create a material:

 wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz tar xvf hadoop-0.20.205.0.tar.gz tar xvf pydoop-0.6.0.tar.gz export JAVA_HOME=/usr/lib/jvm/java-6-sun export JVM_ARCH=64 # I assume that 32 works for 32-bit systems export HADOOP_HOME=/home/hadoop export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/ export HADOOP_VERSION=0.20.205 export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/ cd ~/hadoop-0.20.205.0/src/c++/libhdfs sh ./configure make make install cd ../install tar cvfz ~/libhdfs.tar.gz lib sudo tar xvf ~/libhdfs.tar.gz -C /usr cd ~/pydoop-0.6.0 python setup.py bdist cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/ sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C / 

Save the two archives in the future, you can skip part of the assembly and just follow the steps to install (you need to figure out how to do this boostrap option for installation in node clusters)

 sudo tar xvf ~/libhdfs.tar.gz -C /usr sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C / 

Then I was able to run the sample program using the Full Hadoop API (after fixing the error in the constructor so that it would call super(WordCountMapper, self) ).

 #!/usr/bin/python import pydoop.pipes as pp class WordCountMapper(pp.Mapper): def __init__(self, context): super(WordCountMapper, self).__init__(context) context.setStatus("initializing") self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS") def map(self, context): words = context.getInputValue().split() for w in words: context.emit(w, "1") context.incrementCounter(self.input_words, len(words)) class WordCountReducer(pp.Reducer): def reduce(self, context): s = 0 while context.nextValue(): s += int(context.getInputValue()) context.emit(context.getInputKey(), str(s)) pp.runTask(pp.Factory(WordCountMapper, WordCountReducer)) 

I loaded this program into a bucket and called it running. Then I used the following conf.xml:

 <?xml version="1.0"?> <configuration> <property> <name>hadoop.pipes.executable</name> <value>s3://<my bucket>/run</value> </property> <property> <name>mapred.job.name</name> <value>myjobname</value> </property> <property> <name>hadoop.pipes.java.recordreader</name> <value>true</value> </property> <property> <name>hadoop.pipes.java.recordwriter</name> <value>true</value> </property> </configuration> 

Finally, I used the following command line:

 hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf 
+8
source share

The answer is only partially correct, but the solution is simpler and then does it as follows:

copy this code to the bash file that you create on your computer:

bootstrap.sh:

 #!/bin/bash pip install pydoop 

after you finish writing this file, load it into the s3 bucket.

then you can add a boot action to emr:
Select "Custom Action" Give the path to your s3 bucket. And what is it, you have pydoop installed in the emr cluster.

+2
source share

All Articles