I am running my code in an EMR cluster from Amazon using pyspark. Then, as I did this, follow these steps:
1) Put this bootstrap action in the creation of the cluster (to create the localhost elasticsearch server):
s3:
2) I run these commands to populate the elasticsearch database with some data:
curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }'
You can also run other curl commands if you want:
curl -XGET http://localhost:9200/_search?pretty=true&q={'matchAll':{''}}
3) I enabled pyspark using the following options:
pyspark --driver-memory 5G --executor-memory 10G --executor-cores 2 --jars=elasticsearch-hadoop-5.5.1.jar
I previously downloaded the python elasticsearch client
4) I run the following code:
from pyspark import SparkConf from pyspark.sql import SQLContext q ="""{ "query": { "match_all": {} } }""" es_read_conf = { "es.nodes" : "localhost", "es.port" : "9200", "es.resource" : "movies/movie", "es.query" : q } es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf) sqlContext.createDataFrame(es_rdd).collect()
Then I finally got a successful result from the team.
source share