I have a 2 MB text file in / usr / local / share / data /. And then I run against the following code in Apache Spark.
conf = SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g") sc = SparkContext(conf=conf) doc_rdd = sc.textFile("/usr/local/share/data/") unigrams = doc_rdd.flatMap(word_tokenize) step1 = unigrams.flatMap(word_pos_tagging) step2 = step1.filter(lambda x: filter_punctuation(x[0])) step3 = step2.map(lambda x: (x, 1)) freq_unigrams = step3.reduceByKey(lambda x, y: x + y)
Expected Result
[((u'showing', 'VBG'), 24), ((u'Ave', 'NNP'), 1), ((u'Scrilla364', 'NNP'), 1), ((u'internally', 'RB'), 4), ...]
But it takes a very long time (6 minutes) to return the expected number of words. It keeps on the steps of reduceByKey. How to solve this performance problem?
- Link -
Hardware Specification
Model name: MacBook Air model Identifier: MacBookAir4,2 Processor name: Intel Core i7 processor Speed: 1.8 GHz Number of processors: 1 Total number of cores: 2 L2 cache (for the core): 256 KB L3 cache: 4 MB Memory: 4 GB
Magazine
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:0+873602 15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:873602+873602 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:53478 in memory (size: 4.1 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:53478 in memory (size: 4.6 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 4 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 3 15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:53478 in memory (size: 3.9 KB, free: 530.0 MB) 15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 2 15/10/02 16:10:05 INFO PythonRDD: Times: total = 292892, boot = 8, init = 275, finish = 292609 15/10/02 16:10:05 INFO Executor: Finished task 1.0 in stage 3.0 (TID 4). 2373 bytes result sent to driver 15/10/02 16:10:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 292956 ms on localhost (1/2) 15/10/02 16:10:35 INFO PythonRDD: Times: total = 322562, boot = 5, init = 276, finish = 322281 15/10/02 16:10:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2373 bytes result sent to driver 15/10/02 16:10:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 322591 ms on localhost (2/2)
performance python apache-spark pyspark
Simon ho
source share