AWS EMR and Spark 1.0.0

I recently ran into some issues trying to use Spark in an AWS EMR cluster.

I am creating a cluster using something like:

./elastic-mapreduce --create --alive \ --name "ll_Spark_Cluster" \ --bootstrap-action s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \ --bootstrap-name "Spark/Shark" \ --instance-type m1.xlarge \ --instance-count 2 \ --ami-version 3.0.4 

The problem is that whenever I try to get data from S3, I get an exception. Therefore, if I run the spark shell and try something like:

 val data = sc.textFile("s3n://your_s3_data") 

I get the following exception:

 WARN storage.BlockManager: Putting block broadcast_1 failed java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; 
+7
amazon-web-services elastic-map-reduce apache-spark
source share
1 answer

This problem was caused by the guava library,

The version that on AMI is 11, while the spark requires version 14.

I edited the bootstrap script from AWS to install spark 1.0.2 and update the guava library during the bootstrap action, which you can get here:

https://gist.github.com/tnbredillet/867111b8e1e600fa588e

Even after updating guava, I still had a problem. When I tried to save data to S3, I got an exception

 lzo.GPLNativeCodeLoader - Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path 

I decided that by adding the native hadoop library to java.library.path. When I run the task, I add a parameter

  -Djava.library.path=/home/hadoop/lib/native 

or if I run the task through spark-submit, I add

 --driver-library-path /home/hadoop/lib/native 

argument.

+9
source share

All Articles