PySpark s3 Access with multiple AWS credential profiles?

Question

PySpark s3 Access with multiple AWS credential profiles?

I am writing a pyspark task that should be read from two different s3 buckets. Each bucket has different credentials that are stored on my machine as separate profiles in ~/.aws/credentials.

Is there any way to tell pyspark which profile to use when connecting to s3?

When using one bucket, I set environment variables AWS_ACCESS_KEY_IDand AWS_SECRET_ACCESS_KEYin conf/spark-env.sh. Naturally, this only works to access 1 of 2 buckets.

I know that I can set these values manually in pyspark when necessary, using:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")

But I would prefer a solution in which these values were not hardcoded. Is it possible?

+4

amazon-s3 amazon-web-services apache-spark pyspark

neal May 27 '16 at 9:20

source share

2 answers

MHD Shaker · Answer 1 · 2017-07-12T11:33:58+0000

Different S3 buckets may be available with different S3A client configurations. This allows you to use different endpoints, data reading and writing strategies, and login data.

All fs.s3a parameters other than a small set of immutable values (currently fs.s3a.impl) can be set based on each bucket.
The specific option for the basket is set by replacing fs.s3a. prefix the options with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
When connected to a bucket, all explicitly specified parameters override the fs.s3a database. values.

http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets

chutium · Answer 2 · 2016-05-27T15:07:54+0000

s3n aws, ~/.aws/credentials, hasoop 2.7 hasoop s3 impl: s3a, aws sdk.

, 1.6.1 hadoop 2.7, 2.0 hadoop 2.7 s3a.

1.6.x, s3 EMR... : https://github.com/zalando/spark-appliance#emrfs-support

PySpark s3 Access with multiple AWS credential profiles?

More articles: