I am trying to use an Apache spark to query my data in Elasticsearch, but my spark takes about 20 hours to complete the aggregation and still works. The same request in ES takes about 6 seconds.
I understand that data must move from the Elasticsearch cluster to my spark cluster and some data shuffling in Spark.
The data inside my ES index is approx. 300 million documents and each document has about 400 fields (1.4Terrabyte).
I have a 3 node spark cluster (1 master, 2 employees) with 60 GB of memory and only 8 cores.
The time taken to launch is unacceptable, is there a way to speed up the work on the spark?
Here is my spark configuration:
SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp") .setMaster("spark://10.0.0.203:7077") .set("es.nodes", "10.0.0.207") .set("es.cluster", "wp-es-reporting-prod") .setJars(JavaSparkContext.jarOfClass(Demo.class)) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.default.parallelism", String.valueOf(cpus * 2)) .set("spark.executor.memory", "8g");
Edited
SparkContext sparkCtx = new SparkContext(sparkConf); SQLContext sqlContext = new SQLContext(sparkCtx); DataFrame df = JavaEsSparkSQL.esDF(sqlContext, "customer-rpts01-201510/sample"); DataFrame dfCleaned = cleanSchema(sqlContext, df); dfCleaned.registerTempTable("RPT"); DataFrame sqlDFTest = sqlContext.sql("SELECT agent, count(request_type) FROM RPT group by agent"); for (Row row : sqlDFTest.collect()) { System.out.println(">> " + row); }
source share