To add some difficult numbers to the question above, I implemented a small program to create identifiers along the lines mentioned in the question, and these were the results of one of the test runs:
Length Count MD5 Base 62 4 400 3d0e 446 5 925 4fc04 1Myi 6 2434 0e9368 40V6 7 29155 e406d96 GBFiA 8 130615 7ba787c8 2GOiCm 9 403040 75525d163 YNKjL9 10 1302992 e1b3581f52 H47JAIs Total: 1869561
Each time the hash length increased, this was due to a collision, so there were six collisions in this case. A counter is the number of keys of a given length that were created before the collision. Column MD5 shows the last truncated key that was successfully added before a duplicate key error occurred. The column on the right shows the key in base 62 (if I did the conversion correctly).
It seems that with each additional character more and more keys are generated, which you imagine. I hope this approach will scale for custom content.
For anyone interested, here's how I got the base 62 view for the last column:
def base_62_encode(input): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" rv = "" while input != 0: rv = CLIST[input % 62] + str(rv) input /= 62 return rv base62_id = base_62_encode(long(truncated_hash, 16))
(Added later :)
Here's a class that takes care of a few things related to creating these identifiers in the context of the Google App Engine. By default, the keys created by this class are not purely basic 62, since the first character of the GAE key name must be alphabetic. This requirement was considered here, using base 52 for the first character.
The primary base can be changed to something other than 62, changing the value of "clist", which was passed (or omitted) by the constructor. You could remove characters that are easy to mix, such as "1", "l", "i", etc.
Using:
keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5) small_id, modified = keygen.small_id(seed_value_from_sharded_counter)
Here is the class:
class LengthBackoffIdGenerator(object): """Class to generate ids along the lines of tinyurl. By default, ids begin with a alphabetic character. Each id that is created is checked against previous ones to ensure uniqueness. When a duplicate is generated, the length of the seed value used is increased by one character. Ids become gradually longer over time while remaining relatively short, and there are very few retries in relation to the number of keys generated. """ ids = set() def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False, initial_length=5, check_db=False): self.clist = clist self.alpha_first = alpha_first if self.alpha_first: import re alpha_list = re.sub(r'\d', '', clist) if len(alpha_list) < 1: raise Exception, 'At least one non-numeric character is required.' self.alpha_list = alpha_list self.cls = cls self.length = initial_length self.check_db = check_db def divide_long_and_convert(self, radix, clist, input, rv): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." rv += clist[input % radix] input /= radix return (input, rv) def convert_long(self, input): rv = "" if self.alpha_first: input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv) while input != 0: input, rv = self.divide_long_and_convert(len(self.clist), self.clist, input, rv) return rv def hash_truncate_and_binify(self, input, length): """Compute the MD5 hash of an input string, truncate it and convert it to a long. input - A seed value. length - The length to truncate the MD5 hash to. """ from assessment.core.functions import md5_hash input = md5_hash(input)[0:length] return long(input, 16) def _small_id(self, input): """Create an ID that looks similar to a tinyurl ID. The first letter of the ID will be non-numeric by default. """ return self.convert_long(self.hash_truncate_and_binify(input, self.length)) def small_id(self, seed): key_name = self._small_id(seed) increased_length = False if self.check_db and not self.cls.get_by_key_name(key_name) is None: self.ids.add(key_name) if key_name in self.ids: self.length += 1 increased_length = True key_name = self._small_id(seed) self.ids.add(key_name) return (key_name, increased_length)