You can use a combination of a window function and a combination for this. The window function is used to get the next col2 value (using col1 to order). Then the unit calculates the time during which we are faced with the differences. This is implemented in the code below:
val data = Seq( ("row_1", "item_1"), ("row_2", "item_1"), ("row_3", "item_2"), ("row_4", "item_1"), ("row_5", "item_2"), ("row_6", "item_1")).toDF("col1", "col2") import org.apache.spark.sql.expressions.Window val q = data. withColumn("col2_next", coalesce(lead($"col2", 1) over Window.orderBy($"col1"), $"col2")). groupBy($"col2"). agg(sum($"col2" =!= $"col2_next" cast "int") as "col3") scala> q.show 17/08/22 10:15:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. +------+----+ | col2|col3| +------+----+ |item_1| 2| |item_2| 2| +------+----+
source share