Spark local vs hdfs permormance

I have a Spark cluster and Hdfs on the same machines. I copied one text file, near 3Gbytes, to each machine local file system and to the distributed hdfs file system.

I have a simple pyspark word count program.

If I send a program that reads a file from the local file system, it lasts about 33 seconds. If I send a program that reads a file from hdfs, it lasts about 46 seconds.

Why? I expected exactly the opposite result.

Added after sgvd request:

16 followers 1 master

Spark Standalone without special settings (replication ratio 3)

Version 1.5.2

import sys sys.path.insert(0, '/usr/local/spark/python/') sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip') import os os.environ['SPARK_HOME']='/usr/local/spark' os.environ['JAVA_HOME']='/usr/local/java' from pyspark import SparkContext #conf = pyspark.SparkConf().set<conf settings> if sys.argv[1] == 'local': print 'Esecuzine in modalita local file' sc = SparkContext('spark://192.168.2.11:7077','Test Local file') rdd = sc.textFile('/root/test2') else: print 'Esecuzine in modalita hdfs' sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file') rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2') rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y) topFive = rdd1.takeOrdered(5,key=lambda x: -x[1]) print topFive 
+2
performance hadoop apache-spark
source share
3 answers

This bit counter is intuitive, but since the replication rate is 3 and you have 16 nodes, each node has an average of 20% of the data stored locally in HDFS. Then about 6 work nodes should be sufficient on average to read the entire file without any network transmission.

If you record the operating time and the number of working nodes, you should notice that after about 6 there will be no difference between reading from the local FS and HDFS.

The above calculation can be performed using variables, for example. x=number of worker nodes , y= replication factor , but you can easily understand that, since reading from the local FS imposes that the file is on all nodes, you get x=y , and after using floor(x/y) there will be no differences. This is exactly what you are observing, and at first it seems intuitive. Do you use a 100% replication rate in production?

+1
source share

What are the options specific to Executor, Driver, and RDD (regarding the storage level of Sping ans)?

From Spark documentation

Performance impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for shuffling, Spark generates many tasks - to compare tasks for organizing data and a set of reducible tasks for aggregating them. This nomenclature comes from MapReduce and does not directly apply to the Sparks map and shortens operations.

Some random operations can consume a significant amount of heap memory because they use data structures in memory to organize records before or after they are transferred. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection .

I'm interested in the limitations of memory/CPU core for the limitations of Spark Job Vs memory/CPU core for Map & Reduce tasks.

Key parameters to compare with Hadoop:

 yarn.nodemanager.resource.cpu-vcores mapreduce.map.cpu.vcores mapreduce.reduce.cpu.vcores mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.reduce.shuffle.memory.limit.percent 

Key parameters for comparing SPARK parameters with Hadoop for equivalence.

 spark.driver.memory spark.driver.cores spark.executor.memory spark.executor.cores spark.memory.fraction 

These are just some of the key parameters. See a detailed set of SPARK and Zoom out map

Without the right set of parameters, we cannot compare workplace productivity using two different technologies.

+1
source share

This is because since the data is distributed, one document is not a good option, there are several better alternatives, such as parquet , if you do, you will notice that the performance will be noticeably improved, this is due to the way the file is divided into sections Allows your Apache Spark cluster to read these parts in parallel, thereby improving performance.

0
source share

All Articles