Scala: what is the most suitable data structure for sorted subsets?

Question

Scala: what is the most suitable data structure for sorted subsets?

Given a large collection (let's call it "a") with elements of type T (say, a vector or a list) and the evaluation function "f" (say (T) => Double), I would like to get a set of results from 'a' ', which contains N elements' a', which lead to the highest value in f. Collection 'a' may contain duplicates. It is not sorted.

Perhaps leaving the parallelism question (map / reduce, etc.) aside for a moment, what would be the appropriate Scala data structure to compile the result collection 'b'? Thanks for any pointers / ideas.

Notes:

(1) I suggest that my use case may be most succinctly expressed as

val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example val f : (Int)=>Double = (n)=>n // evaluation function val b = a.sortBy( f ).take( N ) // sort, then clip

except that I do not want to sort the entire set.

(2) one option would be to iterate over 'a', which fills the TreeSet with a "manual" size limit (reject everything that is worse than the worst element in the set, prevent the set from expanding beyond N). However, I would like to keep duplicates present in the source set in the result set, and therefore this may not work.

(3) if the sorted multi-set is the correct data structure, is there a Scala implementation of this? Or a binary-sorted vector or an array if the result set is small enough?

+4

scala data-structures functional-programming

Gregor scheidt Oct 17 '11 at 10:04

source share

1 answer

Travis brown · Accepted Answer · 2011-10-17T11:04:52+0000

You can use the priority queue:

 def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = { val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse) val (before, after) = xs.splitAt(k) q ++= before after.foreach(x => q += ord.max(x, q.dequeue)) q.dequeueAll }

We fill the queue with the first elements of k , and then compare each additional element with the heading of the queue, changing places if necessary. This works as expected and keeps duplicates:

 scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4) res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)

And he does not sort the complete list. I have Ordering in this implementation, but adapting it to use the evaluation function will be pretty trivial.

Scala: what is the most suitable data structure for sorted subsets?

More articles: