(PySpark) Nested lists after reduceByKey

I am sure this is something very simple, but I did not find anything related to this.

My code is simple:

... 
stream = stream.map(mapper) 
stream = stream.reduceByKey(reducer) 
... 

Nothing unusual. The output looks like this:

... 
key1  value1 
key2  [value2, value3] 
key3  [[value4, value5], value6] 
... 

Etc. So sometimes I get a fixed value (if it is single). Sometimes there are nested lists that can be very, very deep (according to my simple test data it was 3 levels).

I tried to search for something like "flat" in the sources - but I only found the flatMap method, which (as I understand it) is not what I need.

I do not know why these lists are nested. I assume that they were processed by various processes (workers?), And then combined without alignment.

, Python, . , - , .

itertools.chain . , ( ).

, , PySpark?

+4
2

. reduceByKey , .

, , . -, (word, 1), reduceByKey(lambda x, y: x + y) . RDD (word, count).

API PySpark:

>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]

, , , :

reduce(reduce(reduce(firstValue, secondValue), thirdValue), fourthValue) ...

, , groupByKey , .

, combineByKey, reduceByKey(), (reduceByKey is combineByKey)

+5

stream.groupByKey().mapValues(lambda x: list(x)).collect()

key1 [value1]
key2 [value2, value3]
key3 [value4, value5, value6]
0

All Articles