Using Amazon MapReduce / Hadoop for Image Processing

I have a project that requires me to process a large number (1000-10000) of large (100 MB to 500 MB) images. The processing that I do can be done using Imagemagick, but I was hoping for this processing on the Amazon Elastic MapReduce platform (which, it seems to me, works using Hadoop).

Of all the examples that I found, they all deal with text inputs (I found that the Word Count example is a billion times). I can’t find anything about such work with Hadoop: starting with a set of files, performing one action for each of the files, and then writing the new file as its own file.

I am sure that this can be done with this platform and should be done using Bash; I don’t think I need to solve the problem of creating an entire Java application or something else, but I am wrong.

I do not ask someone to pass me the code, but if anyone has a sample code or links to textbooks devoted to similar problems, this would be very appreciated ...

+4
source share
4 answers

There are several problems with your task.

Hadoop does not process images as you saw. But you can export all file names and paths as a text file and call some Map function on it. Therefore, the call to ImageMagick in the files on the local disk should not be large.

But how do you deal with the location of the data?

You cannot run ImageMagick in files in HDFS (only the Java API and FUSE mount are unstable), and you cannot predict task scheduling. So, for example, a map task can be assigned to a host where the image does not exist.

Of course, you can just use only one machine and one task. But then you have no improvement. Then you will just have overhead.

There is also a memory issue when exiting a Java task. I made a blog post about this [1].

and should be able to be done using Bash 

This is the next problem, you will need to write a map task at least. You need a ProcessBuilder call ImageMagick with a specific path and function.

I can’t find anything about this work with Hadoop: starting with a set of files that perform the same action for each of the files, and then writes a new file as its own file.

Guess why ?: D Hadoop is not suitable for this task.

Basically, I would recommend manually splitting your images into multiple hosts in EC2 and running a bash script on it. This is less stress and faster. To perform parallel configuration on the same host, separate the files in several folders for each core and run bash scripts with it. This should use your machine well enough, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html

+9
source

I would think that you can look at an example in the "Hadoop: The Definitive Guide" 3rd Edition. Appendix C describes a method, in bash, to get a file (in hdfs), unzip it, create a folder, create a new file from these files in the unpacked folder, and then place this file in another hdfs location.

I set up this script myself, so the initial getoop get is a curl call for the web server that hosts the input files I need - I did not want to put all the files in hdf. If your files are already in hdfs, you can only use the commented line. Hdfs get or curl ensures that the file is available locally for the task. This has a lot of network overhead.

There is no need for a reduction problem.

The input file is a list of file URLs for conversion / download.

 #!/usr/bin/env bash # NLineInputFormat gives a single line: key is offset, value is Isotropic Url read offset isofile # Retrieve file from Isotropic server to local disk echo "reporter:status:Retrieving $isofile" >&2 target=`echo $isofile | awk '{split($0,a,"/");print a[5] a[6]}'` filename=$target.tar.bz2 #$HADOOP_INSTALL/bin/hadoop fs -get $isofile ./$filename curl $isofile -o $filename # Un-bzip and un-tar the local file mkdir -p $target echo "reporter:status:Un-tarring $filename to $target" >&2 tar jxf $filename -C $target # Take the file and do what you want with it. echo "reporter:status:Converting $target" >&2 imagemagick convert .... $target/$filename $target.all # Put gzipped version into HDFS echo "reporter:status:Gzipping $target and putting in HDFS" >&2 gzip -c $target.all | #$HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz 

The New York Times processed 4TB raw image data into pdf files in 24 hours using Hadoop. They seem to have taken a similar approach: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/?scp=1&sq=self%20service%20prorated&st=cse . They used java api, but the rest is to get the file locally, process it, and then paste it back into hdfs / sc3.

+4
source

You can take a look at CombineFileInputFormat in Hadoop, which can implicitly combine multiple files and split them based on files.

But I'm not sure how you are going to process the 100M-500M images, as it is quite large and actually larger than the split Hadoop size. Perhaps you can try different approaches when splitting one image into several parts.

In any case, good luck.

0
source

I have been looking for solutions for large-scale remote sensing images in Hadoop for a long time. And I haven’t received anything so far!

Here is an open source project about splitting a large-scale image on samller in Hadoop. I carefully read the code and tested them. But I found that the performances are not as good as the wait. In any case, it can be useful and shed light on the problem.

Matsu Project: http://www.cloudbook.net/directories/research-clouds/research-project.php?id=100057

Good luck

0
source

All Articles