Make PRNG Agreements for All Software

I will investigate whether matching two sets of software is possible on a sequence of generated pseudo random numbers. I am interested in understanding all the possible points of divergence, as I am in finding a way to reconcile them.

Why? I work in a data store that uses many different software packages (Stata, R, Python, SAS, possibly others). There has recently been interest in QCing findings by replicating processes in another language. For any process that includes random numbers, it would be useful if we could provide a series of steps ("set this parameter", etc.) that allow two packages to be consistent. If this is not feasible, I would like to be able to articulate where the points of failure are.

A simple example:

Both default random number generators R and Python are Mersenne-Twister. I put them in the same seed and tried to try and also look at the "state" of the PRNG. None of the values ​​are consistent.

R (3.2.3, 64-bit):

set.seed(20160201) .Random.seed sample(c(1, 2, 3, 4, 5)) 

Python (3.5.1, 64-bit):

 import random random.seed(20160201) random.getstate() random.sample([1, 2, 3, 4, 5], 5) 
+7
python random r
source share
1 answer

An old question, but perhaps useful for the future reader: as mentioned in the comments, it is best to implement this yourself and provide interfaces for different environments so that the same initial results are returned for a given initial number. Why is this necessary? You used "sampling" as an example. There are several steps.

  1. Sowing is a non-trivial process. For example, R goes even further to further scramble the provided seed. Thus, if the user tools do not use the same method, they will get a different initial value, even if the user provides the same value.

  2. Actual RNG: Although a Mersenne-Twister can be used in both cases, is it really the same version? R uses 32-bit MT. Maybe Python uses a 64-bit version?

  3. Most RNGs give you an unsigned integer (currently usually 32 or 64 bits). But you will need some distribution of random numbers, for example, for sampling you will need random integers in a given range. There are many ways to go from the integers created by the RNG to those that are needed for sampling. In the case of R, you do not even have access to the RNG output value. The most fundamental function is R_unif which returns a double value in [0, 1). Again, they do not always agree on how to create such a double. And if you need other distribution functions (normal, exponential, ...), you will find many different algorithms for them.

In general, there are many places where (subtle) differences can sneak up.

+1
source share

All Articles