Import external libraries into the Hadoop MapReduce script

Question

Import external libraries into the Hadoop MapReduce script

I am running a python MapReduce script on top of the implementation of the Amazons EMR Hadoop. As a result, I get similar things from the main scenarios. At the aftereffect stage, I want to split this output into a separate S3 bucket for each element, so each bucket element contains a list of elements similar to it. For this, I want to use the python Amazons boto library as a function to reduce the postoperative period.

How to import external (python) libraries in hadoop so that they can be used at the reduction stage written in python?
Is it possible to access S3 this way inside a Hadoop environment?

Thanks in advance Thomas

+2

python amazon-web-services mapreduce hadoop amazon-emr

Thomas Feb 13 '11 at 15:13

source share

1 answer

Nija · Accepted Answer · 2011-02-15T21:35:46+0000

When the hasoop process starts, you can specify the external files that should be available. This is done using the -files argument.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don’t know if the files should be on HDFS, but if this is a job that will work often, it would be a good idea to put them there. From the code you can do something similar to

 if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) { List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration())); for (Path localFile : localFiles) { if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) { Path path = new File(localFile.toUri().getPath()); } } }

That's all, except copying and pasting directly from working code inside several of our Mappers.

I do not know about the second part of your question. Hope the answer to the first part helps you get started. :)

In addition to -files there is -libjars to include additional cans; I have a little information about here - If I have a constructor that requires a file path, how can I "fake" what if it is packed in a jar?

Import external libraries into the Hadoop MapReduce script

More articles: