I understand that the partitionBy function is sharing my data. If I use rdd.partitionBy(100) , it will split my data with the key into 100 parts. that is, data associated with similar keys will be grouped together
- Do I understand correctly?
- Is it desirable to have a number of partitions equal to the number of available kernels? Does this make processing more efficient?
- What to do if my data is not in a key format. Can i use this function?
- lets say that my data is serial_number_of_student, student_name. In this, can I split my data by the name student_name, not serial_number?
python partitioning apache-spark pyspark rdd
user2543622
source share