Reading data from Azure Blob with Spark

Question

Reading data from Azure Blob with Spark

I'm having trouble reading data from azure spots using a spark stream

JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");

code like above works for HDFS but cannot read file from Azure blob

 https://blobstorage.blob.core.windows.net/containerid/folder1/

Above is the path that is shown in the azure UI, but it does not work, I am missing something and how can we get it.

I know Eventhub is an ideal choice for streaming data, but my current situation requires memory usage rather than queues

+7

java azure apache-spark azure-storage-blobs spark-streaming

duck Jun 11 '16 at 12:02

source share

2 answers

As an extra, there is a tutorial on the Azure Blob HDFS-compatible storage with Hadoop, which is very useful, see https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage .

Meanwhile, there is an official sample on GitHub for Spark streaming on Azure. Unfortunately, the sample is written for Scala, but I think it is still useful for you.

+1

Peter Pan Jun 13 '16 at 6:14

source share

Yuval Itzchakov · Accepted Answer · 2016-06-11T13:42:57+0000

There are two things you need to do to read data from blob storage. First, you need to tell Spark which native file system to use in the basic Hadoop configuration. This means that you will also need a Hadoop-Azure JAR to access your class path (note that there may be runtime requirements for more JARs associated with the Hadoop family):

 JavaSparkContext ct = new JavaSparkContext(); Configuration config = ct.hadoopConfiguration(); config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem"); config.set("fs.azure.account.key.youraccount.blob.core.windows.net", "yourkey");

Now call the file using the wasb:// prefix (note that [s] for an additional secure connection):

 ssc.textFileStream("wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path>");

It goes without saying that you will need to have the appropriate permissions set from the location making the request in the blob repository.

Reading data from Azure Blob with Spark

More articles: