How youtube calculates a unique 11-digit code for each video

Youtube has a unique 11-digit code for each video. the code contains 1-9,AZ,az , and some characters, such as +_* , etc.

How will they calculate this unique code for each video? I am working on something where I would like to assign a unique code for each entry, hence the question.

My questions / concerns:

  • If they do it on the fly (when the videos are sent), then they will need to check if the code prepared for the video already exists or not? This would be a costly operation through a huge dataset such as theirs.
  • Whether they will run a batch job every night or every month, which creates unique codes and writes them to the database. Then, when the video is posted, it simply takes the code and puts it as "used"
  • Would it be advisable to take an automatically generated and automatically increasing ID column for each record in the database, and then somehow convert this unique ID column to an 11-digit code?

My goal:

  • create a unique code for writing to the table.
  • The user can share the URL with this unique code with anyone.
  • When someone comes in through a unique code. Then their β€œarrival” is attached to the original user who shared the URL with a unique code.
+6
source share
4 answers

Read the GUID and UID in general.

In most cases, if you use a database that will generate a unique identifier for you, then this unique identifier can be encoded in numbers and letters to shorten the total string.

http://en.wikipedia.org/wiki/Globally_unique_identifier

The shortening of the line is related to how you encode the value; it does not actually change it.

For example, the number 15 in base 10 uses two digits, in hexadecimal it uses one digit (f) in binary, it uses 4 (1111).

In the same way, you can use az, AZ, 0-9 and get base 62 for encoding numbers into strings using a much smaller number of digits than using base 10.

This is not the only approach, but (especially if you already have database rows), this is the easiest. You don't even need to pad to 11 if you really don't want to - but adding any number 0 to the beginning of the encoded line does not change its value.

Java even provides functions for this, although the maximum radius on them is 36:

http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#toString%28int,%20int%29

+2
source

The problem with the hashing function for the full set of possible URLs, and then checking it against an indexed database, is that it removes synchronization capabilities. Think about how long it takes to download a video, checking it against your database requires almost instant, which is not a problem. The same problem arises when you think about precomputing: synchronization requires one access point if you want to use distributed computers, and I'm sure they do. I think your third point is probably the closest to the correct one, and then some code is somehow encoded into a longer number for some reason (I'm not really sure if this is an advantage over the cost int, who has a good reason?)

+1
source

Here is a way to do it efficiently, as well as make it random: -

  • Make the size M hash table as large as possible.

  • generate the first M-numbers in random order using a search in the hash table.

  • when exhausted, do what the algorithm suggests in the link below (sorry to reuse a similar problem).

Create unique phone numbers

Edit: - I know that this solution applies to numbers, but you can always translate numbers into characters using a simple mapping for each digit.

0
source

All this back and forth led me to do some more research on the youtube backend. Here is what I came up with.

This leads me to think that they use MySQL to store video metadata. Some of the following factors depend on the assumption that they are using a relational data warehouse.

I think the 11-character base64 identifier is actually a 64-bit value based on base64. 64^11 = (2^6)^11 = 2^66 , which is too close to 2^64 to be a match.

I strongly suspect that part of this identifier comes from the fragment identifier in which the metadata of the video is stored. Say they dedicate 24 bits (16,777,216) to the fragment identifier. They probably use this entire range, but they don't have 16 million shards. Instead, they are likely to assign each shard a range of these shard identifiers to simplify the revision. The fragment identifier to which this video is assigned is probably pseudo-random. When the shard begins to fill, they separate it and update the ranges. Plain.

At least part of the remaining bits is probably an automatically incrementing value local to the fragment.

If there are any bits after this, they are probably busy with a pseudo-random number, timestamp, or something similar. There is also the possibility that they include other data related to a particular implementation, but this can cause big problems if they ever had to migrate, so I suspect that they will shy away from it.

0
source

All Articles