How to flatMap function on GroupedDataSet in Apache Flink

I want to apply a function via flatMap to each group created by DataSet.groupBy . Attempting to call flatMap I get a compiler error:

 error: value flatMap is not a member of org.apache.flink.api.scala.GroupedDataSet 

My code is:

 var mapped = env.fromCollection(Array[(Int, Int)]()) var groups = mapped.groupBy("myGroupField") groups.flatMap( myFunction: (Int, Array[Int]) => Array[(Int, Array[(Int, Int)])] ) // error: GroupedDataSet has no member flatMap 

In fact, in the documentation flinkscala 0.9-SNAPSHOT no map or similar. Is there a similar method to work? How to achieve the desired distributed display for each group separately on node?

+7
scala hadoop apache-flink
source share
1 answer

You can use reduceGroup(GroupReduceFunction f) to process all the elements of the group. A GroupReduceFunction gives you Iterable over all elements of a group, and Collector gives you an arbitrary number of elements.

The Flink groupBy() function does not group several elements into one element, i.e. does not convert a group of elements (Int, Int) (which all use the same tptle _1 field) to one (Int, Array[Int]) . Instead, a DataSet[(Int, Int)] logically grouped so that all elements that have the same key can be processed together. When you apply a GroupReduceFunction in a GroupedDataSet , the function will be called once for each group. In each call, all elements of the group are passed along with the function. Then the function can process all the elements of the group, and also convert the group of elements (Int, Int) into one element (Int, Array[Int]) .

+4
source share

All Articles