I am writing a pyspark task that should be read from two different s3 buckets. Each bucket has different credentials that are stored on my machine as separate profiles in ~/.aws/credentials.
Is there any way to tell pyspark which profile to use when connecting to s3?
When using one bucket, I set environment variables AWS_ACCESS_KEY_IDand AWS_SECRET_ACCESS_KEYin conf/spark-env.sh. Naturally, this only works to access 1 of 2 buckets.
I know that I can set these values manually in pyspark when necessary, using:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")
But I would prefer a solution in which these values were not hardcoded. Is it possible?
source
share