Using pyspark, reading / writing 2D images in hadoop file system

I want to be able to read / write images in the hdfs file system and use the hdfs location.

I have a collection of images where each image consists of

  • Uint16 2D arrays
  • basic additional information stored as an XML file.

I want to create an archive through the hdfs file system and use a spark to analyze the archive. Right now I'm struggling to save data on the hdfs file system in order to be able to fully use the spark + hdfs structure.

From what I understand, the best way would be to create a sequenceFile wrapper. I have two questions:

  • Is sequenceFile wrapper a better way?
  • Does anyone have a pointer to examples that I could use to get started? I should not be the first to read something other than a text file on hdfs through a spark!
+7
hadoop apache-spark pyspark sequencefile
source share
1 answer

I found a solution that works: using the pyspark 1.2.0 binary, the job is done. It is marked as experimental, but I was able to read tiff images with the proper combination of openCV.

import cv2 import numpy as np # build rdd and take one element for testing purpose L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1) # convert to bytearray and then to np array file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8) # use opencv to decode the np bytes array R = cv2.imdecode(file_bytes,1) 

Check out pyspark's help:

 binaryFiles(path, minPartitions=None) :: Experimental Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. Note: Small files are preferred, large file is also allowable, but may cause bad performance. 
+7
source share

All Articles