Your data looks like raw bytes from a real image file (jpg?). The problem with your data is that it should be bytes, not unicode. You need to figure out how to convert from unicode to bytes. There is a whole can of worms full of coding traps that you have to deal with, but you can get lucky with img.encode('iso-8859-1') . I do not know, and I will not deal with this in my answer.
The raw data for the PNG image is as follows:
rawdata = '\x89PNG\r\n\x1a\n\x00\x00...\x00\x00IEND\xaeB`\x82'
Once you get it in bytes, you can create a PIL image from the raw data and read it as nparray:
>>> from StringIO import StringIO >>> from PIL import Image >>> import numpy as np >>> np.asarray(Image.open(StringIO(rawdata))) array([[[255, 255, 255, 0], [255, 255, 255, 0], [255, 255, 255, 0], ..., [255, 255, 255, 0], [255, 255, 255, 0], [255, 255, 255, 0]]], dtype=uint8)
All you need to get it working on Spark is SparkContext.binaryFiles :
>>> images = sc.binaryFiles("path/to/images/") >>> image_to_array = lambda rawdata: np.asarray(Image.open(StringIO(rawdata))) >>> images.values().map(image_to_array)