Python is a dynamically typed language, and PySpark does not use a special type for key pairs, values. The only requirement for an object that is considered valid for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Usually you should use two tuple elements because of its semantics (fixed-size immutable object) and the similarity of the Scala Product classes. But this is just an agreement, and nothing stops you from something like this:
key_value.py
class KeyValue(object): def __init__(self, k, v): self.k = k self.v = v def __iter__(self): for x in [self.k, self.v]: yield x
from key_value import KeyValue rdd = sc.parallelize( [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) rdd.reduceByKey(add).collect()
and create an arbitrary class as a key value. So, once again, if something can be unpacked correctly as a pair of objects, then this is the actual value of the key. The implementation of the magic methods __len__ and __getitem__ should also work. Probably the most elegant way to handle this is to use namedtuples .
Also type(rdd.take(1)) returns a list length n , so its type will always be the same.
zero323
source share