Elasticsearch + Apache Spark Performance

I am trying to use an Apache spark to query my data in Elasticsearch, but my spark takes about 20 hours to complete the aggregation and still works. The same request in ES takes about 6 seconds.

I understand that data must move from the Elasticsearch cluster to my spark cluster and some data shuffling in Spark.

The data inside my ES index is approx. 300 million documents and each document has about 400 fields (1.4Terrabyte).

I have a 3 node spark cluster (1 master, 2 employees) with 60 GB of memory and only 8 cores.

The time taken to launch is unacceptable, is there a way to speed up the work on the spark?

Here is my spark configuration:

SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp") .setMaster("spark://10.0.0.203:7077") .set("es.nodes", "10.0.0.207") .set("es.cluster", "wp-es-reporting-prod") .setJars(JavaSparkContext.jarOfClass(Demo.class)) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.default.parallelism", String.valueOf(cpus * 2)) .set("spark.executor.memory", "8g"); 

Edited

  SparkContext sparkCtx = new SparkContext(sparkConf); SQLContext sqlContext = new SQLContext(sparkCtx); DataFrame df = JavaEsSparkSQL.esDF(sqlContext, "customer-rpts01-201510/sample"); DataFrame dfCleaned = cleanSchema(sqlContext, df); dfCleaned.registerTempTable("RPT"); DataFrame sqlDFTest = sqlContext.sql("SELECT agent, count(request_type) FROM RPT group by agent"); for (Row row : sqlDFTest.collect()) { System.out.println(">> " + row); } 
+5
source share
2 answers

I realized what was going on, basically, I tried to manipulate the dataframe schema, because I have some fields with a dot, for example user.firstname. This seems to cause a problem in the spark collection phase. To solve this problem, I just had to reindex my data so that there are no more dots in my fields, but an underscore, for example user_firstname.

+5
source

Iโ€™m afraid that you will not be able to complete a group of more than 1.4 TB with a total capacity of only 120 GB and achieve good performance. DF will try to load all the data into memory / disk, and only then will it execute the group. I do not think that the spark / ES connector is currently translating sql syntax into the ES query language.

+1
source

All Articles