Choosing elements from a python list based on probability

I am creating a python script that randomly selects 1000 names from the list of male names that are here: http://www.census.gov/genealogy/www/data/1990surnames/names_files.html

This works fine and dandy, but I would like the names to be selected based on the probability column provided by the census text files (second column).

I have been trying to circle the last few hours, but I have not made any real progress, even looking for other answers.

Can someone help me or point me in the right direction? Thanks in advance:)

+6
source share
3 answers

Light weighted selection algorithm:

  • Assign each name its relative probability, so that the sum of all the probabilities is 1. This relative value is called "weight."

  • Choose a random number from 0 to 1

  • Go through the list by subtracting the weight of each item from your number when you go

  • When you go to 0 or lower, select the current item.

+5
source

The third column of the data file is the cumulative probability, the current sum of the second column.

To select a random name for a cumulative probability distribution:

  • Create a random number from 0 to 1,
  • Find the first line whose cumulative probability is greater than a random number.
  • Select a name in this row.

import urllib2 import random import bisect url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first' response = urllib2.urlopen(url) names, cumprobs = [], [] for line in response: name, prob, cumprob, rank = line.split() cumprob = float(cumprob) names.append(name) cumprobs.append(cumprob) # normalize the cumulative probabilities to the range [0, 1] cumprobs = [p/cumprobs[-1] for p in cumprobs] # print(cumprobs) # Generate 1000 names at random, using the cumulative probability distribution N = 1000 selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)] print('\n'.join(selected)) 

Please note: the alias method has better computational complexity, but it may not be very important for your use case to select just 1000 elements.

+2
source

A quick and VERY dirty hack that will work for smaller datasets is simply to add the specified name several times equal to the weighted distribution. Note that this will consume a whole bunch of memory, especially in large datasets, so consider this as a quick implementation for small weighted distributions ONLY.

 import random filename = r"location/of/file" data = list() # accumulator with open(filename) as in_: for line in in_: name, prob, *_ = line.split() for _ in range(int(float(prob)*1000)): data.append(name) print(random.choice(data)) 
0
source

All Articles