Using S3 (Frankfurt) with Spark

Question

Using S3 (Frankfurt) with Spark

Does anyone use s3 in Frankfurt using hasoop / spark 1.6.0?

I am trying to save the result of a job on s3, my dependencies are declared as follows:

"org.apache.spark" %% "spark-core" % "1.6.0" exclude("org.apache.hadoop", "hadoop-client"), "org.apache.spark" %% "spark-sql" % "1.6.0", "org.apache.hadoop" % "hadoop-client" % "2.7.2", "org.apache.hadoop" % "hadoop-aws" % "2.7.2"

I installed the following configuration:

 System.setProperty("com.amazonaws.services.s3.enableV4", "true") sc.hadoopConfiguration.set("fs.s3a.endpoint", ""s3.eu-central-1.amazonaws.com")

When you call saveAsTextFile on my RDD, it starts fine, saving everything to S3. However, after a while, when it moves from _temporary to the final output result, it throws an error:

 Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXXXX, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507) at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143) at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

If I use hadoop-client from a spark packet, it does not even start the transfer. The error occurs randomly, sometimes it works, and sometimes not.

+10

scala amazon-s3 hadoop apache-spark

flaviotruzzi Apr 15 '16 at 12:23

source share

4 answers

Dany · Answer 1 · 2016-09-07T18:19:24+0000

Try setting the following values:

 System.setProperty("com.amazonaws.services.s3.enableV4", "true") hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoopConf.set("com.amazonaws.services.s3.enableV4", "true") hadoopConf.set("fs.s3a.endpoint", "s3." + region + ".amazonaws.com")

indicate the region in which this bucket is located, in my case it was: eu-central-1

and add the dependency to gradle or some other way:

 dependencies { compile 'org.apache.hadoop:hadoop-aws:2.7.2' }

hope this helps.

asmaier · Answer 2 · 2017-09-07T12:06:23+0000

If you use pyspark, the following worked for me

 aws_profile = "your_profile" aws_region = "eu-central-1" s3_bucket = "your_bucket" # see https://github.com/jupyter/docker-stacks/issues/127#issuecomment-214594895 os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell" # If this doesn't work you might have to delete your ~/.ivy2 directory to reset your package cache. # (see https://github.com/databricks/spark-redshift/issues/244#issuecomment-239950148) import pyspark sc=pyspark.SparkContext() # see https://github.com/databricks/spark-redshift/issues/298#issuecomment-271834485 sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") # see https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark hadoop_conf=sc._jsc.hadoopConfiguration() # see https://stackoverflow.com/questions/43454117/how-do-you-use-s3a-with-spark-2-1-0-on-aws-us-east-2 hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true") hadoop_conf.set("fs.s3a.access.key", access_id) hadoop_conf.set("fs.s3a.secret.key", access_key) # see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region hadoop_conf.set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com") sql=pyspark.sql.SparkSession(sc) path = s3_bucket + "your_file_on_s3" dataS3=sql.read.parquet("s3a://" + path)

tanuj · Answer 3 · 2018-11-14T16:18:11+0000

Inspired by the answers of others, running the following directly in the pyspark shell gave the desired result for me:

 sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") # fails without this hc=sc._jsc.hadoopConfiguration() hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hc.set("com.amazonaws.services.s3.enableV4", "true") hc.set("fs.s3a.endpoint", end_point) hc.set("fs.s3a.access.key",access_key) hc.set("fs.s3a.secret.key",secret_key) data = sc.textFile("s3a://bucket/file") data.take(3)

Select your endpoint at: list of endpoints I managed to get data from the Asia-Pacific region (Mumbai) (ap-south-1), which is a region only for Version 4.

Ramsh · Answer 4 · 2019-05-21T21:25:53+0000

The above solutions do not work for me.

My context: EMR is in the US-EAST-1 region and is accessing a segment that is in the US-WEST-2 region.

through the Java SDK I can access it by specifying a region and a proxy.

I added the property below to sparkcontext.

sparkContext.hadoopConfiguration (). set ("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com");

spark.read.testFile ("s3: //bucket-west3/file.txt");

I get connection time,

caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException: connection to cof-qa-fscdbx-untrst-west2.s3.amazonaws.com:443 [xxxxxxxxx-west2. s3.amazonaws.com/52.218.xx] failed: connection timed out on com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect (DefaultHttpClientConnectionOperator.java:150 ) on com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect (PoolingHttpClientConnectionManager.javahaps53)

Using S3 (Frankfurt) with Spark

More articles: