Apache Spark error on reduceByKey stage

I have a 2 MB text file in / usr / local / share / data /. And then I run against the following code in Apache Spark.

conf = SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g") sc = SparkContext(conf=conf) doc_rdd = sc.textFile("/usr/local/share/data/") unigrams = doc_rdd.flatMap(word_tokenize) step1 = unigrams.flatMap(word_pos_tagging) step2 = step1.filter(lambda x: filter_punctuation(x[0])) step3 = step2.map(lambda x: (x, 1)) freq_unigrams = step3.reduceByKey(lambda x, y: x + y) 

Expected Result

 [((u'showing', 'VBG'), 24), ((u'Ave', 'NNP'), 1), ((u'Scrilla364', 'NNP'), 1), ((u'internally', 'RB'), 4), ...] 

But it takes a very long time (6 minutes) to return the expected number of words. It keeps on the steps of reduceByKey. How to solve this performance problem?

- Link -

Hardware Specification

Model name: MacBook Air model Identifier: MacBookAir4,2 Processor name: Intel Core i7 processor Speed: 1.8 GHz Number of processors: 1 Total number of cores: 2 L2 cache (for the core): 256 KB L3 cache: 4 MB Memory: 4 GB

Magazine

 15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:0+873602 15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:873602+873602 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:53478 in memory (size: 4.1 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:53478 in memory (size: 4.6 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 4 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 3 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:53478 in memory (size: 3.9 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 2 15/10/02 16:10:05 INFO PythonRDD: Times: total = 292892, boot = 8, init = 275, finish = 292609 15/10/02 16:10:05 INFO Executor: Finished task 1.0 in stage 3.0 (TID 4). 2373 bytes result sent to driver 15/10/02 16:10:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 292956 ms on localhost (1/2) 15/10/02 16:10:35 INFO PythonRDD: Times: total = 322562, boot = 5, init = 276, finish = 322281 15/10/02 16:10:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2373 bytes result sent to driver 15/10/02 16:10:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 322591 ms on localhost (2/2) 
+1
performance python apache-spark pyspark
source share
1 answer

The code looks fine.

You can try several options to improve performance.

 SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g") 

locallocal[*] , if the task is broken, it may take up the number of cells available on the computer. And if possible, increase the available memory for the program

PS And to evaluate Spark - you must have a good amount of data so that you can run it in a cluster

+1
source share

All Articles