Spark - No file system for schema: https, cannot download files from Amazon S3

Question

Spark - No file system for schema: https, cannot download files from Amazon S3

I am trying to load some data from an Amazon S3 bucket:

SparkConf sparkConf = new SparkConf().setAppName("Importer"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); HiveContext sqlContext = new HiveContext(ctx.sc()); DataFrame magento = sqlContext.read().json("https://s3.eu-central-1.amazonaws.com/*/*.json");

However, this last line throws an error:

 Exception in thread "main" java.io.IOException: No FileSystem for scheme: https

The same line works in another project, what am I missing? I am running Spark on a Hortonworks CentOS virtual machine.

+6

java amazon-s3 apache-spark

lte__ Sep 06 '16 at 18:13

source share

1 answer

Piotr reszke · Answer 1 · 2016-09-08T07:03:10+0000

By default, Spark supports HDFS, S3, and local. Access to S3 can be obtained using the s3a: // or s3n: // protocols (the difference between the s3a, s3n, and s3 protocols)

Therefore, to access the file, it is best to use the following:

 s3a://bucket-name/key

Depending on your spark version and libraries included, you may need to add external banks:

Correct the read file from S3 with sc.textFile ("s3n: // ...)

(Are you sure that you are using previous s3 projects with the https protocol? Perhaps you had a special code or banks included to support the https protocol?)

Spark - No file system for schema: https, cannot download files from Amazon S3

More articles: