For image processing on Hadoop, the best way to organize calculations is:
- Saving images in a sequence file. The name of the key image or its identifier, the binary data Value - image. Thus, you will have one file with all the images that you need to process. If you have images added dynamically to your system, consider combining them in daily sequence files. I donβt think you should use compression for this sequence file since general compression algorithms do not work with images
- Process images. Here you have several choices. First, use Hadoop MapReduce and write a program in Java, as with Java, you can read the sequence file and directly get a βValueβ from it at each step of the map, where βvalueβ is a binary file. Given this, you can run any processing logic. The second option is Hadoop Streaming. It has the limitation that all data goes to the stdin of your application, and the result is read from stdout. But you can overcome this by writing your own InputFormat in Java, which will serialize the binary image data from the sequence file as a Base64 string and pass it to your shared application. A third option would be to use Spark to process this data, but again, you are limited in programming languages: Scala, Java, or Python.
- Hadoop was designed to simplify the batch processing of large amounts of data. The spark is very important - it is a batch tool. This means that you cannot get the result before all the data has been processed. Spark Streaming is a slightly different case - there you work with micropackets for 1-10 seconds and process each of them separately, so, in general, you can make it work for your business.
I do not know your complete case, but one of the possible solutions is to use Kafka + Spark Streaming. Your application should place the images in binary format in the Kafka queue while Spark will consume and process them in micropackages on the cluster, updating users through some third component (at least by placing the image processing status in Kafka for another application to process it)
But in general, the information you provide is not complete to recommend a good architecture for your particular case.
source share