I finally got this job. Everything happens on the master node ... ssh before this node as a hasoop user
You need packages:
sudo easy_install argparse importlib sudo apt-get update sudo apt-get install libboost-python-dev
To create a material:
wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz tar xvf hadoop-0.20.205.0.tar.gz tar xvf pydoop-0.6.0.tar.gz export JAVA_HOME=/usr/lib/jvm/java-6-sun export JVM_ARCH=64 # I assume that 32 works for 32-bit systems export HADOOP_HOME=/home/hadoop export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/ export HADOOP_VERSION=0.20.205 export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/ cd ~/hadoop-0.20.205.0/src/c++/libhdfs sh ./configure make make install cd ../install tar cvfz ~/libhdfs.tar.gz lib sudo tar xvf ~/libhdfs.tar.gz -C /usr cd ~/pydoop-0.6.0 python setup.py bdist cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/ sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /
Save the two archives in the future, you can skip part of the assembly and just follow the steps to install (you need to figure out how to do this boostrap option for installation in node clusters)
sudo tar xvf ~/libhdfs.tar.gz -C /usr sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /
Then I was able to run the sample program using the Full Hadoop API (after fixing the error in the constructor so that it would call super(WordCountMapper, self) ).
#!/usr/bin/python import pydoop.pipes as pp class WordCountMapper(pp.Mapper): def __init__(self, context): super(WordCountMapper, self).__init__(context) context.setStatus("initializing") self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS") def map(self, context): words = context.getInputValue().split() for w in words: context.emit(w, "1") context.incrementCounter(self.input_words, len(words)) class WordCountReducer(pp.Reducer): def reduce(self, context): s = 0 while context.nextValue(): s += int(context.getInputValue()) context.emit(context.getInputKey(), str(s)) pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))
I loaded this program into a bucket and called it running. Then I used the following conf.xml:
<?xml version="1.0"?> <configuration> <property> <name>hadoop.pipes.executable</name> <value>s3://<my bucket>/run</value> </property> <property> <name>mapred.job.name</name> <value>myjobname</value> </property> <property> <name>hadoop.pipes.java.recordreader</name> <value>true</value> </property> <property> <name>hadoop.pipes.java.recordwriter</name> <value>true</value> </property> </configuration>
Finally, I used the following command line:
hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf
Nathan binkert
source share