The third column of the data file is the cumulative probability, the current sum of the second column.
To select a random name for a cumulative probability distribution:
- Create a random number from 0 to 1,
- Find the first line whose cumulative probability is greater than a random number.
- Select a name in this row.
import urllib2 import random import bisect url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first' response = urllib2.urlopen(url) names, cumprobs = [], [] for line in response: name, prob, cumprob, rank = line.split() cumprob = float(cumprob) names.append(name) cumprobs.append(cumprob)
Please note: the alias method has better computational complexity, but it may not be very important for your use case to select just 1000 elements.
source share