How to randomly select from a list or Scala array?

I want to randomly select from a list or Scala array (not RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently ? Since the sample size can be very large, and the selection (according to different lists / arrays) should be performed many times.

I know that for Spark RDD we can use takeSample () to execute it, is there an equivalent for Scala list / array?

Many thanks.

+7
arrays list scala sample apache-spark
source share
6 answers

An easy-to-understand version will look like this:

import scala.util.Random Random.shuffle(list).take(n) Random.shuffle(array.toList).take(n) // Seeded version val r = new Random(seed) r.shuffle(...) 
+23
source share

For arrays:

 import scala.util.Random import scala.reflect.ClassTag def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = { val rnd = new Random(seed) Array.fill(n)(a(rnd.nextInt(a.size))) } 

Create a random number generator ( rnd ) based on your seed. Then fill the array with random numbers from 0 to the size of your array.

The final step is to apply each random value to the index operator of your input array. Using it in a REPL might look like this:

 scala> val myArray = Array(1,3,5,7,8,9,10) myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10) scala> takeSample(myArray,20,System.currentTimeMillis) res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10, 1, 1, 3, 1, 7, 1, 3, 7) 

For lists, I would just convert the list to Array and use the same function. I doubt that you can still become much more efficient for lists.

It is important to note that the same function using lists will take O (n ^ 2) time, while first converting the list into arrays will take O (n) time

+3
source share

Using for understanding, for a given xs array as follows,

 for (i <- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r) 

Note that the random generator here generates values ​​within a unit interval that are scaled for a range by the size of the array and converted to Int for indexing by array.

Note For a pure functional random generator, consider, for example, the State Monad approach from Functional Programming in Scala , discussed here .

Note Consider also NICTA , another generator of a pure functional random value generator; it is used, for example, to illustrate here .

+1
source share

Using classic recursion.

 import scala.util.Random def takeSample[T](a: List[T], n: Int): List[T] = { n match { case n: Int if n <= 0 => List.empty[T] case n: Int => a(Random.nextInt(a.size)) :: takeSample(a, n - 1) } } 
+1
source share
 package your.pkg import your.pkg.SeqHelpers.SampleOps import scala.collection.generic.CanBuildFrom import scala.collection.mutable import scala.language.{higherKinds, implicitConversions} import scala.util.Random trait SeqHelpers { implicit def withSampleOps[E, CC[_] <: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc) } object SeqHelpers extends SeqHelpers { case class SampleOps[E, CC[_] <: Seq[_]](cc: CC[_]) { private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match { case 0 => builder.result case _ => val element = cc(Random.nextInt(cc.size)).asInstanceOf[E] recurse(n - 1, builder += element) } def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = { require(n >= 0, "Cannot take less than 0 samples") recurse(n, cbf.apply) } } } 

Or:

  • Mixin SeqHelpers , e.g. with Scalatest specification
  • Enable import your.pkg.SeqHelpers._

Then the following should work:

 Seq(1 to 100: _*) sample 10 foreach { println } 

Permissions to remove the throw are welcome.

Also, if there is a way to create an empty collection instance for the battery without knowing the specific type ahead of time, comment. However, the builder is probably more efficient.

0
source share

If you want to try without replacement - zip with randoms, select O(n*log(n) , cancel randoms, take

 import scala.util.Random val l = Seq("a", "b", "c", "d", "e") val ran = l.map(x => (Random.nextFloat(), x)) .sortBy(_._1) .map(_._2) .take(3) 
0
source share

All Articles