Using S3 (Frankfurt) with Spark

Does anyone use s3 in Frankfurt using hasoop / spark 1.6.0?

I am trying to save the result of a job on s3, my dependencies are declared as follows:

"org.apache.spark" %% "spark-core" % "1.6.0" exclude("org.apache.hadoop", "hadoop-client"), "org.apache.spark" %% "spark-sql" % "1.6.0", "org.apache.hadoop" % "hadoop-client" % "2.7.2", "org.apache.hadoop" % "hadoop-aws" % "2.7.2" 

I installed the following configuration:

 System.setProperty("com.amazonaws.services.s3.enableV4", "true") sc.hadoopConfiguration.set("fs.s3a.endpoint", ""s3.eu-central-1.amazonaws.com") 

When you call saveAsTextFile on my RDD, it starts fine, saving everything to S3. However, after a while, when it moves from _temporary to the final output result, it throws an error:

 Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXXXX, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507) at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143) at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134) at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 

If I use hadoop-client from a spark packet, it does not even start the transfer. The error occurs randomly, sometimes it works, and sometimes not.

+10
scala amazon-s3 hadoop apache-spark
source share
4 answers

Try setting the following values:

 System.setProperty("com.amazonaws.services.s3.enableV4", "true") hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoopConf.set("com.amazonaws.services.s3.enableV4", "true") hadoopConf.set("fs.s3a.endpoint", "s3." + region + ".amazonaws.com") 

indicate the region in which this bucket is located, in my case it was: eu-central-1

and add the dependency to gradle or some other way:

 dependencies { compile 'org.apache.hadoop:hadoop-aws:2.7.2' } 

hope this helps.

+4
source share

If you use pyspark, the following worked for me

 aws_profile = "your_profile" aws_region = "eu-central-1" s3_bucket = "your_bucket" # see https://github.com/jupyter/docker-stacks/issues/127#issuecomment-214594895 os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell" # If this doesn't work you might have to delete your ~/.ivy2 directory to reset your package cache. # (see https://github.com/databricks/spark-redshift/issues/244#issuecomment-239950148) import pyspark sc=pyspark.SparkContext() # see https://github.com/databricks/spark-redshift/issues/298#issuecomment-271834485 sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") # see https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark hadoop_conf=sc._jsc.hadoopConfiguration() # see https://stackoverflow.com/questions/43454117/how-do-you-use-s3a-with-spark-2-1-0-on-aws-us-east-2 hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true") hadoop_conf.set("fs.s3a.access.key", access_id) hadoop_conf.set("fs.s3a.secret.key", access_key) # see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region hadoop_conf.set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com") sql=pyspark.sql.SparkSession(sc) path = s3_bucket + "your_file_on_s3" dataS3=sql.read.parquet("s3a://" + path) 
+3
source share

Inspired by the answers of others, running the following directly in the pyspark shell gave the desired result for me:

 sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") # fails without this hc=sc._jsc.hadoopConfiguration() hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hc.set("com.amazonaws.services.s3.enableV4", "true") hc.set("fs.s3a.endpoint", end_point) hc.set("fs.s3a.access.key",access_key) hc.set("fs.s3a.secret.key",secret_key) data = sc.textFile("s3a://bucket/file") data.take(3) 

Select your endpoint at: list of endpoints I managed to get data from the Asia-Pacific region (Mumbai) (ap-south-1), which is a region only for Version 4.

0
source share

The above solutions do not work for me.

My context: EMR is in the US-EAST-1 region and is accessing a segment that is in the US-WEST-2 region.

through the Java SDK I can access it by specifying a region and a proxy.

I added the property below to sparkcontext.

sparkContext.hadoopConfiguration (). set ("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com");

spark.read.testFile ("s3: //bucket-west3/file.txt");

I get connection time,

caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException: connection to cof-qa-fscdbx-untrst-west2.s3.amazonaws.com:443 [xxxxxxxxx-west2. s3.amazonaws.com/52.218.xx] failed: connection timed out on com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect (DefaultHttpClientConnectionOperator.java:150 ) on com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect (PoolingHttpClientConnectionManager.javahaps53)

0
source share

All Articles