Spark - No file system for schema: https, cannot download files from Amazon S3

I am trying to load some data from an Amazon S3 bucket:

SparkConf sparkConf = new SparkConf().setAppName("Importer"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); HiveContext sqlContext = new HiveContext(ctx.sc()); DataFrame magento = sqlContext.read().json("https://s3.eu-central-1.amazonaws.com/*/*.json"); 

However, this last line throws an error:

 Exception in thread "main" java.io.IOException: No FileSystem for scheme: https 

The same line works in another project, what am I missing? I am running Spark on a Hortonworks CentOS virtual machine.

+6
source share
1 answer

By default, Spark supports HDFS, S3, and local. Access to S3 can be obtained using the s3a: // or s3n: // protocols (the difference between the s3a, s3n, and s3 protocols)

Therefore, to access the file, it is best to use the following:

 s3a://bucket-name/key 

Depending on your spark version and libraries included, you may need to add external banks:

Correct the read file from S3 with sc.textFile ("s3n: // ...)

(Are you sure that you are using previous s3 projects with the https protocol? Perhaps you had a special code or banks included to support the https protocol?)

+1
source

All Articles