How to determine if an object is a valid key-value pair in PySpark

  • If I have rdd, as I understand it, the data is in the key: value format? there is a way to find the same thing - something like type (object) tells me the type of object. I tried print type(rdd.take(1)) , but it just says <type 'list'> .
  • Say I have data like (x,1),(x,2),(y,1),(y,3) , and I use groupByKey and got (x,(1,2)),(y,(1,3)) . Is there a way to define (1,2) and (1,3) as values, where x and y are keys? Or should the key be the only value? I noticed that if I use the reduceByKey and sum function to get the data ((x,3),(y,4)) , then it becomes much easier to define this data as a pair of key values
+4
python key apache-spark pyspark value
source share
1 answer

Python is a dynamically typed language, and PySpark does not use a special type for key pairs, values. The only requirement for an object that is considered valid for PairRDD operations is that it can be unpacked as follows:

 k, v = kv 

Usually you should use two tuple elements because of its semantics (fixed-size immutable object) and the similarity of the Scala Product classes. But this is just an agreement, and nothing stops you from something like this:

key_value.py

 class KeyValue(object): def __init__(self, k, v): self.k = k self.v = v def __iter__(self): for x in [self.k, self.v]: yield x 
 from key_value import KeyValue rdd = sc.parallelize( [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) rdd.reduceByKey(add).collect() ## [('bar', 0), ('foo', 3)] 

and create an arbitrary class as a key value. So, once again, if something can be unpacked correctly as a pair of objects, then this is the actual value of the key. The implementation of the magic methods __len__ and __getitem__ should also work. Probably the most elegant way to handle this is to use namedtuples .

Also type(rdd.take(1)) returns a list length n , so its type will always be the same.

+1
source share

All Articles