Tiny random identifier generation

Question

Tiny random identifier generation

I am trying to create unique identifiers for use in the Google App Engine application and would like to receive feedback on the possibility of the approach I'm going to use (questions at the end). I have read quite a few questions on this topic, but I do not remember meeting this specific approach.

I would like random looking identifiers like MD5 hashes, but I also want them to be small. Four to six characters, according to Tinyurl, would be ideal. The identifiers will be for content created by users in the context of my application, for example, for questions that people will write. The identifiers are not required to be random (this is great if they look like serial identifiers), but the approach I'm going to use is suitable for this, so this is not a problem.

People familiar with the Google App Engine will know that data warehouse entries are especially expensive and can lead to timeouts if there are too many of them in the same entity group. Shaded counters are an approach that is often used to avoid competition with the list on the same global counter and the failed transactions that go with it.

Along with getting short identifiers and avoiding competition in the record, I try to avoid a paradoxical birthday. I would like to prepare for the possibility of having millions of identifiers, even if it goes a little overboard.

I thought of using a fragment counter on the following lines:

The counter is omitted by users, so for each user there is a shard. Each counter object has its own account, which is specific for a given user, which increases when a new element is created by this user. The score is incremented regardless of whether the item was created successfully.
The basis of the identifier is an MD5 hash of the following line: "<user-email-address> | <last-counter-value>".
The resulting MD5 hash is then truncated, initially to four characters.
One global length value is stored. Whenever the previous steps result in a duplicate key (one of them assumes that this will happen rather quickly at the beginning), the length value will be increased by one. MD5 hashes for new identifiers will now be truncated with "length" characters, rather than four characters.
I do not want to disclose the user's email address, which suggests that some kind of hash may be a good way.

My questions are: do I correctly believe that in many ways to avoid competition for writing as a result of duplicate keys, and that arguing about competition in the length field is probably not a problem, especially for longer lengths? Can anyone describe the math here? Will the length increase rapidly to approximately the length of the MD5 hash, questioning the value of the whole approach? Would it be better to just go with a full (longer) MD5 hash so that everything is easier to maintain? Is there something I'm missing out on?

+6

google-app-engine md5

Eric walker May 02, '09 at 21:07

source share

3 answers

How about something like this:

If you want to use 4 character keys with a-zA-Z0-9, then you will have: 62 ^ 4 => 14 million possible values.
Break it into N sections: 0000 ... 1000, 1001 ... 2000, ..., ZZAA ... ZZZZ
Each section is represented by an entity with: start id end identifier current id

Now to generate the id:

randomly select a section. Use the startup key for each partition as the key. Start Transaction:
get_or_create (), which is part of the section.
if current id == end id, go to step 1
save current id
increase current identifier by one Transaction end

use the current id saved as your id.

If you select N as 140, each section will have 100,000 values. This would allow quite a few simultaneous inserts, limiting the number of attempts due to the choice of an "empty" section.

You may need to think about it more. Especially, how do you deal with a situation where you need to switch to 5 or 6-digit keys?

-Dave

+2

Dave May 03 '09 at 3:20

source share

To add some difficult numbers to the question above, I implemented a small program to create identifiers along the lines mentioned in the question, and these were the results of one of the test runs:

Length Count MD5 Base 62 4 400 3d0e 446 5 925 4fc04 1Myi 6 2434 0e9368 40V6 7 29155 e406d96 GBFiA 8 130615 7ba787c8 2GOiCm 9 403040 75525d163 YNKjL9 10 1302992 e1b3581f52 H47JAIs Total: 1869561

Each time the hash length increased, this was due to a collision, so there were six collisions in this case. A counter is the number of keys of a given length that were created before the collision. Column MD5 shows the last truncated key that was successfully added before a duplicate key error occurred. The column on the right shows the key in base 62 (if I did the conversion correctly).

It seems that with each additional character more and more keys are generated, which you imagine. I hope this approach will scale for custom content.

For anyone interested, here's how I got the base 62 view for the last column:

 def base_62_encode(input): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" rv = "" while input != 0: rv = CLIST[input % 62] + str(rv) input /= 62 return rv base62_id = base_62_encode(long(truncated_hash, 16))

(Added later :)

Here's a class that takes care of a few things related to creating these identifiers in the context of the Google App Engine. By default, the keys created by this class are not purely basic 62, since the first character of the GAE key name must be alphabetic. This requirement was considered here, using base 52 for the first character.

The primary base can be changed to something other than 62, changing the value of "clist", which was passed (or omitted) by the constructor. You could remove characters that are easy to mix, such as "1", "l", "i", etc.

Using:

 keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5) small_id, modified = keygen.small_id(seed_value_from_sharded_counter)

Here is the class:

 class LengthBackoffIdGenerator(object): """Class to generate ids along the lines of tinyurl. By default, ids begin with a alphabetic character. Each id that is created is checked against previous ones to ensure uniqueness. When a duplicate is generated, the length of the seed value used is increased by one character. Ids become gradually longer over time while remaining relatively short, and there are very few retries in relation to the number of keys generated. """ ids = set() def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False, initial_length=5, check_db=False): self.clist = clist self.alpha_first = alpha_first if self.alpha_first: import re alpha_list = re.sub(r'\d', '', clist) if len(alpha_list) < 1: raise Exception, 'At least one non-numeric character is required.' self.alpha_list = alpha_list self.cls = cls self.length = initial_length self.check_db = check_db def divide_long_and_convert(self, radix, clist, input, rv): "Inspired by code at http://en.wikipedia.org/wiki/Base_36." rv += clist[input % radix] input /= radix return (input, rv) def convert_long(self, input): rv = "" if self.alpha_first: input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv) while input != 0: input, rv = self.divide_long_and_convert(len(self.clist), self.clist, input, rv) return rv def hash_truncate_and_binify(self, input, length): """Compute the MD5 hash of an input string, truncate it and convert it to a long. input - A seed value. length - The length to truncate the MD5 hash to. """ from assessment.core.functions import md5_hash input = md5_hash(input)[0:length] return long(input, 16) def _small_id(self, input): """Create an ID that looks similar to a tinyurl ID. The first letter of the ID will be non-numeric by default. """ return self.convert_long(self.hash_truncate_and_binify(input, self.length)) def small_id(self, seed): key_name = self._small_id(seed) increased_length = False if self.check_db and not self.cls.get_by_key_name(key_name) is None: self.ids.add(key_name) if key_name in self.ids: self.length += 1 increased_length = True key_name = self._small_id(seed) self.ids.add(key_name) return (key_name, increased_length)

+2

Eric walker May 03 '09 at 4:25

source share

chmike · Accepted Answer · 2009-05-08T09:40:46+0000

This algorithm is smart and really minimizes write operations. Thus, the length value will converge to the balance between the shortest and the absence of collisions.

You can reduce the collision avoidance limit by using the strategies used in the hash tables. Try other unique identifiers before backing off to increase length.

Therefore, I would suggest adding a test counter to the end of the hash line initialized to 0. If the generated identifier is already in use, increase the counter and try again until the maximum value of the counter is reached after you increase the length.

As a result, you will get more efficient use of your ID address space and much less frequent increments.

Regarding your question about MD5 length limit, I think that choosing MD5 is redundant. You do not need a cryptographic (pseudo) secure hash. You need a random bit generator for which you can use crc32 (or adler, which is faster). Especially if the code must be running on a mobile phone. To implement a random bit generator using crc32, add a 32 bit value to the line in the hash and initialize it with a constant value of your choice (seed). Then calculate crc32 on this line. If you need more bytes, write the resulting crc32 value to 32 bits before the line and recount crc32. Iterate until you have enough bits. You can replace crc32 with the algorithm of your choice. This gives a cheap and fast random bit generator where the seed constant is the source constant.

Using this algorithm, you minimize the number of random bits generated and also do not have an upper limit on the length.

As for the encoding, you did not specify a restriction. Can you use upper and lower case letters with numbers? Your example uses an alphabet of 36 different ASCII values. If you have a pseudo random number generator described above that can generate as many bytes as needed, just specify the length as the number of letters of your identifier and select the next letter with the next random byte module. You will understand exactly how many bytes to generate at a time and generating an identifier is trivial.

Tiny random identifier generation

More articles: