I wrote a method that should take into account a random number to simulate the Bernoulli distribution. I use random.nextDouble to generate a number from 0 to 1, and then make my decision based on this value, given my probability parameter.
My problem is that Spark generates the same random numbers in every iteration of my loop mapping function. I am using the DataFrame API. My code follows this format:
val myClass = new MyClass() val M = 3 val myAppSeed = 91234 val rand = new scala.util.Random(myAppSeed) for (m <- 1 to M) { val newDF = sqlContext.createDataFrame(myDF .map{row => RowFactory .create(row.getString(0), myClass.myMethod(row.getString(2), rand.nextDouble()) }, myDF.schema) }
Here is the class:
class myClass extends Serializable { val q = qProb def myMethod(s: String, rand: Double) = { if (rand <= q) // do something else // do something else } }
I need a new random number every time myMethod is myMethod . I also tried to create a number inside my method using java.util.Random ( scala.util.Random v10 does not extend Serializable ) as shown below, but I still get the same numbers in every loop
val r = new java.util.Random(s.hashCode.toLong) val rand = r.nextDouble()
I did some research, and it seems to be related to the deterministic nature of Sparks.
scala random apache-spark spark-dataframe
Brian vanover
source share