Java random string generation and birthday paradox

I need to write a random string generation class that generates 7char strings from the character set 31-w980> and some alphabets (10 + 26-5, 5 vowels are omitted). simple math gives a set of 31 ^ 7 possible combinations of ~ 27.5 billion. I have questions regarding the bday paradox, I conducted several tests, and the number of duplicates increased exponentially. can i do something to avoid this?

At 1 million, duplicates encountered till now = 19 At 2 million, duplicates encountered till now = 69 At 3 million, duplicates encountered till now = 157 At 4 million, duplicates encountered till now = 280 At 5 million, duplicates encountered till now = 470 At 6 million, duplicates encountered till now = 662 At 7 million, duplicates encountered till now = 896 At 8 million, duplicates encountered till now = 1185 At 9 million, duplicates encountered till now = 1500 At 10 million, duplicates encountered till now = 1823 At 11 million, duplicates encountered till now = 2204 At 12 million, duplicates encountered till now = 2584 At 13 million, duplicates encountered till now = 3020 At 14 million, duplicates encountered till now = 3527 At 15 million, duplicates encountered till now = 4110 At 16 million, duplicates encountered till now = 4683 At 17 million, duplicates encountered till now = 5284 At 18 million, duplicates encountered till now = 5919 At 19 million, duplicates encountered till now = 6611 At 20 million, duplicates encountered till now = 7343 At 21 million, duplicates encountered till now = 8095 At 22 million, duplicates encountered till now = 8858 At 23 million, duplicates encountered till now = 9707 At 24 million, duplicates encountered till now = 10547 At 25 million, duplicates encountered till now = 11452 At 26 million, duplicates encountered till now = 12399 At 27 million, duplicates encountered till now = 13356 At 28 million, duplicates encountered till now = 14393 At 29 million, duplicates encountered till now = 15369 At 30 million, duplicates encountered till now = 16436 

The following is a test class:

 import java.util.Set; import org.apache.commons.lang3.RandomStringUtils; import com.google.common.collect.Sets; public class RandomUnivmylocaL { public static void main(String[] argv) { final int million = 1_000_000; final int iterations = 30; // 31 chars final char[] charArr = new char[] { '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z' }; // System.out.println(charArr.length); final Set<String> set = Sets.newHashSetWithExpectedSize( iterations * million); for (int i = 0; i < iterations; i++) { for (int j = 0; j < million; j++) { final String univCode = RandomStringUtils.random(7, charArr); set.add(univCode); } System.out.println("At " + (i + 1) + " million, " + "duplicates encountered till now = " + (((i + 1) * million) - set.size())); } System.out.println("done"); } } 
+6
source share
1 answer

This is a birthday paradox.

sqrt (27.5bn) = 165831, the paradoxical birthday formula for large M is 1.177 * sqrt (M), so when you create about 200,000 you will have a 50/50 chance of a problem by the time you get to the million you will have about 18 problems, etc.

Birthday problem - how many until a collision probability arises - about 200,000. Http://www.wolframalpha.com/input/?i=n%3D200000,+m%3D(31%5E7),+n+ - + m (1 + - + ((m-1)% 2Fm)% 5En)

Set n = 23.0 and m = 365 to see 23 people in the room equivalent. http://www.wolframalpha.com/input/?i=n%3D23.0,+m%3D365,+n+-+m(1+-+((m-1)%2Fm)%5En)

You can see how close your simulation is to the expected answer for large numbers.

http://www.wolframalpha.com/input/?i=n%3D30e6,+m%3D(31%5E7),+n+-+m(1+-+((m-1)%2Fm)%5En )

Quora article is good. http://www.quora.com/How-can-I-calculate-the-expected-number-of-cache-hits .

So, you need to increase the number of valid characters or use a longer string. OR - to use 7 digits, simply increase the counter. OR use random ones, and check if you have used this before, and reset when you are tired of looking for new numbers.

There are also pseudo-random generators that can cover space without repeated hits. 7 characters just do not help protect your decision.

+1
source

All Articles