Is Spark zipWithIndex safe with parallel implementation?

If I have a file and I made a zipWithIndex RDD for each line,

([row1, id1001, name, address], 0) ([row2, id1001, name, address], 1) ... ([row100000, id1001, name, address], 100000) 

Can I get the same index order if I reload the file? Since it works in parallel, can other lines be split differently?

+5
source share
1 answer

RDD can be sorted and therefore have order. This order is used to create the index using .zipWithIndex() .

To get the same order, each time depends on what previous calls are made in your program. The docs mention that .groupBy() can destroy an order or create different orderings. There may be other challenges that make it.

I assume you can always call .sortBy() before calling .zipWithIndex() if you need to guarantee a specific order.

This is explained in .zipWithIndex() scala API docs

public RDD<scala.Tuple2<T,Object>> zipWithIndex() Replaces this RDD with its element indices. The order is first based on the index section, and then the ordering of the elements in each section. So the first element in the first section gets index 0, and the last element in the last section gets the highest index. This is similar to Scala zipWithIndex, but uses Long instead of Int as an index type. This method should start a spark job when this RDD contains more than one section.

Note that some RDDs, such as those returned by groupBy (), do not guarantee the order of elements in a section. The index assigned to each item is therefore not guaranteed and may even change if the RDD is double-checked. If, to ensure the same index assignment, you must sort the RDD with sortByKey () or save it to a file.

+7
source

All Articles