PySpark s3 Access with multiple AWS credential profiles?

I am writing a pyspark task that should be read from two different s3 buckets. Each bucket has different credentials that are stored on my machine as separate profiles in ~/.aws/credentials.

Is there any way to tell pyspark which profile to use when connecting to s3?

When using one bucket, I set environment variables AWS_ACCESS_KEY_IDand AWS_SECRET_ACCESS_KEYin conf/spark-env.sh. Naturally, this only works to access 1 of 2 buckets.

I know that I can set these values ​​manually in pyspark when necessary, using:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")

But I would prefer a solution in which these values ​​were not hardcoded. Is it possible?

+4
source share
2 answers

Different S3 buckets may be available with different S3A client configurations. This allows you to use different endpoints, data reading and writing strategies, and login data.

  • All fs.s3a parameters other than a small set of immutable values ​​(currently fs.s3a.impl) can be set based on each bucket.
  • The specific option for the basket is set by replacing fs.s3a. prefix the options with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
  • When connected to a bucket, all explicitly specified parameters override the fs.s3a database. values.

http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets

+2

s3n aws, ~/.aws/credentials, hasoop 2.7 hasoop s3 impl: s3a, aws sdk.

, 1.6.1 hadoop 2.7, 2.0 hadoop 2.7 s3a.

1.6.x, s3 EMR... : https://github.com/zalando/spark-appliance#emrfs-support

0

All Articles