Import external libraries into the Hadoop MapReduce script

I am running a python MapReduce script on top of the implementation of the Amazons EMR Hadoop. As a result, I get similar things from the main scenarios. At the aftereffect stage, I want to split this output into a separate S3 bucket for each element, so each bucket element contains a list of elements similar to it. For this, I want to use the python Amazons boto library as a function to reduce the postoperative period.

  • How to import external (python) libraries in hadoop so that they can be used at the reduction stage written in python?
  • Is it possible to access S3 this way inside a Hadoop environment?

Thanks in advance Thomas

+2
source share
1 answer

When the hasoop process starts, you can specify the external files that should be available. This is done using the -files argument.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don’t know if the files should be on HDFS, but if this is a job that will work often, it would be a good idea to put them there. From the code you can do something similar to

 if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) { List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration())); for (Path localFile : localFiles) { if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) { Path path = new File(localFile.toUri().getPath()); } } } 

That's all, except copying and pasting directly from working code inside several of our Mappers.

I do not know about the second part of your question. Hope the answer to the first part helps you get started. :)

In addition to -files there is -libjars to include additional cans; I have a little information about here - If I have a constructor that requires a file path, how can I "fake" what if it is packed in a jar?

+4
source

All Articles