GroupByKey with datasets in Spark 2.0 using Java

Question

GroupByKey with datasets in Spark 2.0 using Java

I have a dataset containing the following data:

|c1| c2| --------- | 1 | a | | 1 | b | | 1 | c | | 2 | a | | 2 | b |

...

Now I want the data to be grouped as follows: (col1: String Key, col2: List) :

 | c1| c2 | ----------- | 1 |a,b,c| | 2 | a, b| ...

I thought using goupByKey would be a sufficient solution, but I cannot find any example how to use it.

Can someone help me find a solution using groupByKey or use any other combination of transformations and actions to get this result using datasets rather than RDD?

+5

java group-by dataset apache-spark apache-spark-2.0

Andreas Sep 08 '16 at 12:26

source share

3 answers

With DataFrame in Spark 2.0:

 scala> val data = List((1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b")).toDF("c1", "c2") data: org.apache.spark.sql.DataFrame = [c1: int, c2: string] scala> data.groupBy("c1").agg(collect_list("c2")).collect.foreach(println) [1,WrappedArray(a, b, c)] [2,WrappedArray(a, b)]

+1

J bentz Nov 18 '16 at 19:27

source share

This will read the table in the dataset variable

 Dataset<Row> datasetNew = dataset.groupBy("c1").agg(functions.collect_list("c2")); datasetNew.show()

0

Vijay anantharamu Dec 6 '17 at 4:59

source share

abaghel · Accepted Answer · 2016-11-19T04:01:42+0000

Here is an example of Spark 2.0 and Java with Dataset.

 public class SparkSample { public static void main(String[] args) { //SparkSession SparkSession spark = SparkSession .builder() .appName("SparkSample") .config("spark.sql.warehouse.dir", "/file:C:/temp") .master("local") .getOrCreate(); //input data List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>(); inputList.add(new Tuple2<Integer,String>(1, "a")); inputList.add(new Tuple2<Integer,String>(1, "b")); inputList.add(new Tuple2<Integer,String>(1, "c")); inputList.add(new Tuple2<Integer,String>(2, "a")); inputList.add(new Tuple2<Integer,String>(2, "b")); //dataset Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2"); dataSet.show(); //groupBy and aggregate Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2"); dataSet1.show(); //stop spark.stop(); } }

GroupByKey with datasets in Spark 2.0 using Java

More articles: