Pyspark partioning data using partitionby

Question

Pyspark partioning data using partitionby

I understand that the partitionBy function is sharing my data. If I use rdd.partitionBy(100) , it will split my data with the key into 100 parts. that is, data associated with similar keys will be grouped together

Do I understand correctly?
Is it desirable to have a number of partitions equal to the number of available kernels? Does this make processing more efficient?
What to do if my data is not in a key format. Can i use this function?
lets say that my data is serial_number_of_student, student_name. In this, can I split my data by the name student_name, not serial_number?

+7

python partitioning apache-spark pyspark rdd

user2543622 Mar 13 '16 at 17:45

source share

2 answers

I recently used partitionby. I did a restructuring of my data, so that all the ones that I want to be placed in one section have the same key, which, in turn, is the data value. my data was a dictionary list that I converted to tupples with a key from the dictionary. In fact, the partition by did not store the same keys in the same partition. But then I realized that the keys were strings. I threw them into an int. But the problem continued. The numbers were very large. Then I matched these numbers with small numerical values and it worked. So what I remove is that the keys must be small integers.

-one

Souvik Saha Bhowmik Jul 05 '17 at 17:54

source share

zero323 · Accepted Answer · 2016-03-13T18:10:13+0000

Not really. Spark, including PySpark, uses hash splitting by default . Excluding identical keys, there is no practical similarity between the keys assigned to one partition.
There is no simple answer. It all depends on the amount of data and available resources. Too many or too few partitions will degrade performance.
Some resources claim that the number of partitions should be twice the number of available cores. On the other hand, one partition usually should not contain more than 128 MB, and one shuffle block cannot be more than 2 GB (see SPARK-6235 ).
Finally, you must adjust for possible data skews. If some keys are overrepresented in your dataset, this can lead to suboptimal use of resources and potential failure.
No, or at least not directly. You can use the keyBy method to convert the RDD to the desired format. Moreover, any Python object can be considered as a key-value pair if it implements the necessary methods that behave as Iterable lengths of two. See How to determine if an object is a valid key-value pair in PySpark .
It depends on the types. As long as the key hashable *, then yes. This usually means that it should be an immutable structure, and all its values should be immutable. For example, a list is not a valid key , but a tuple integers.

To quote the Python glossary :

A hashable object if it has a hash value that never changes during its life cycle (it needs the __hash__() method) and can be compared with other objects (it needs the __eq__() method). Hashable objects that are compared equal must have the same hash value.

Pyspark partioning data using partitionby

More articles: