Calculating the exact median in a distributed environment requires some effort, so let's say you want something like the square of all the values ββin the RDD. Call this method squaresand assume that it should work as follows:
assert rdd.squares().collect() == rdd.map(lambda x: x * x).collect()
1. Change the definition pyspark.RDD:
from pyspark import RDD
def squares(self):
return self.map(lambda x: x * x)
RDD.squares = squares
rdd = sc.parallelize([1, 2, 3])
assert rdd.squares().collect() == [1, 4, 9]
Note. If you change the class definition, each instance will gain access to squares.
2. Create a subclass of RDD:
class RDDWithSquares(RDD):
def squares(self):
return self.map(lambda x: x * x)
rdd = sc.parallelize([1, 2, 3])
rdd.__class__ = RDDWithSquares
Assigning a class is a dirty hack, so in practice you need to create the RDD properly (see, for example, context.parallelize ).
3. Add method to instance
import types
rdd = sc.parallelize([1, 2, 3])
rdd.squares = types.MethodType(squares, rdd)
Renouncement
, , , .
, , . - , , , currying pipes .
from toolz import pipe
pipe(
sc.parallelize([1, 2, 3]),
squares,
lambda rdd: rdd.collect())