None of these answers are particularly clear or simple.
Here is a simple and understandable method that is guaranteed to work.
accumulate_normalize_probabilities accepts a dictionary p that maps characters to frequencies OR . It displays a useful list of tuples from which to make a choice.
def accumulate_normalize_values(p): pi = p.items() if isinstance(p,dict) else p accum_pi = [] accum = 0 for i in pi: accum_pi.append((i[0],i[1]+accum)) accum += i[1] if accum == 0: raise Exception( "You are about to explode the universe. Continue ? Y/N " ) normed_a = [] for a in accum_pi: normed_a.append((a[0],a[1]*1.0/accum)) return normed_a
Productivity:
>>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200 } ) [('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]
Why does it work
The accumulation step turns each symbol into a gap between itself and the previous symbols: probability or frequency (or 0 in the case of the first symbol). These intervals can be used to select from (and, therefore, select the provided distribution) by simply scrolling through the list until a random number in the interval 0.0 β 1.0 (prepared earlier) is less than or equal to the current endpoint of the character interval .
normalization frees us from the need to make sure that everything sums up to a certain value. After normalization, the "vector" of probabilities is added up to 1.0.
The rest of the code for selecting and creating an arbitrarily long sample from the distribution is below:
def select(symbol_intervals,random): print symbol_intervals,random i = 0 while random > symbol_intervals[i][1]: i += 1 if i >= len(symbol_intervals): raise Exception( "What did you DO to that poor list?" ) return symbol_intervals[i][0] def gen_random(alphabet,length,probabilities=None): from random import random from itertools import repeat if probabilities is None: probabilities = dict(zip(alphabet,repeat(1.0))) elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)): probabilities = dict(zip(alphabet,probabilities))
Using:
>>> gen_random (['a','b','c','d'],10,[100,300,400,200]) ['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c']