Unique identifiers for users

If I have a table of hundreds of users, I would just set the user-auto-increment userID column as the primary key. But if suddenly we have a million users or 5 million users, then it becomes very difficult because I want to begin to spread more, and in this case, the primary key with auto-increment is useless, since each node will create the same primary keys.

Is a solution to this use natural primary keys? It is very difficult for me to think of a natural primary key for this user group. The problem is that they are all young people, so they do not have national insurance numbers or any other unique identifier that I can think of. I could create a primary key with several columns, but there is still a chance, however a fuzzy amount of duplicates.

Does anyone know of a solution?

thanks

+7
sql primary-key natural-key
source share
9 answers

I would say that for now, save auto-increment for user ID.

When you have such a sudden fever of millions of users, you might consider changing it.

In other words, solve the problem when you have it. "premature optimization is the root of all evil."

To answer the question - some automatic increments will allow you to sow an automatic increment so that you can receive different automatic increments on different nodes. This will avoid the problem, while maintaining the ability to use automatic increment.

+11
source share

The standard solution here is to use a GUID. However, they will not work in terms of indexing.

+8
source share

GUIDs are good, but are subject to collision (albeit rarely).

It may be a non-standard solution, but I'm going to throw it there:

You can use automatically increasing numbers, but divide the number space according to the distribution in the future.

So let's say you have 3 servers. Record the identifiers as follows:

Server 1: 0 - 9999.999
Server 2: 10,000,000 - 19,999,999
Server 3: 20,000,000 - 29,999,999

Even within the limitations of a 32-bit int, which should leave a lot of expansion space (even if you are worried about the possibilities of spaces of 100,000,000), this will significantly guarantee the uniqueness of the system.

+2
source share

if you need millions of identifiers and there are many nodes, make the primary key composite:

NodeID int --unique for each node 2 or 4 byte UserID int --auto increment 8 byte, repeats for each node 

which is better than GUID (less, uses less memory and will be faster)

+2
source share

Never use natural primary keys unless you want poor performance and the ability for bad data. There are very few natural keys that cannot change over time, especially names. If the natural key changes, all changes to child records must also change. This is clearly a bad thing.

You can use GUIDS. But 5 million do not represent anything in terms of data and probably will not require changes. We have more than 10,000,000 people in our system, and we only have a medium-sized database without the participation or need for a GUID.

+1
source share

GUID is a simple exit, but ...

How distributed should it be? If this is a limited number of databases, you can provide each database with a number of numbers to use. So, for example, the first database automatically generates numbers in the range from 0 to 999,999, and in the next - from 1,000,000 to 1,999,999. Thus, each of them can generate a user ID without colliding with each other. If the database contains a unique number identifying it, then ranges can be automatically generated from this number.

I don't think you can use an auto-increment column for this, but a stored procedure can generate numbers this way.

0
source share

GUIDs are garbage collected as keys in clustering. If you are not clustered, you still need the clustered index in another column.

Use an integer key and for each new node / site

  • Increment in steps of 10. When adding nodes, just start at 2, 3, etc.
  • Use ranges, e.g. 1-> 1,000,000, 1,000,000 → 1999999, etc.
  • And do not forget - and so too. For example, you can have IDENTITY (-1, -1) for the second node

If you have sites / sites, then the second column with SiteID will work.

0
source share

If you use MSSQL, you can create the PK of your table as UNIQUEIDENTIFIER and set the default or binding to NEWID ().

0
source share

I suggest you never take into account the GUID, one of the reasons is that I am currently having problems with them, if you have millions of users, then you may need a large degree of concurrency, and the guides will ruin your life while inserting and delete, you will have a pointer to them, and by default it will be a Clustered index, which means that when you have a clustered index, each insertion and deletion will move the record physically, and, in addition, the Guides are not sequential, so the probability is h to insert each new top or bottom of the page. therefore, the general operation of inserting and deleting will become very expensive, and if you delete the index, your selections will become expensive.

Specifically, if you have several tables and there are relationships between them, do not consider Guides as the Primary Key.

The following two solutions I would recommend.

  • if you can create compound keys that would be ideally as if its banking software then could be branchId, transactionId will become the main key, where branchId is the identifier of the node, inserting the record, and transactionId is the auto num in the branch so you get uniqueness all the way.

  • If the above is not what you like, or you think, then you can use Guid as a unique file, but add an auto-increase number as the primary key, this will help you reduce the total cost, for example, when a client (node) sends data using (web services) RPC, then you need to insert the record into the server database, then the autonomer number will be created, and this autonomous number can be used for future selection, deletion or updating, but the client does not need to know about this autonomer

I understand that the second solution is a bit confusing and complicated, but it is still better than using Guids as PK. but if solution 1 is applicable, we pass to it.

When I say "Cost", this is not only processing time, but also blocking (waiting) time, which means that it is a complete waste of money, and your quad-core server can execute half of them, and more locks mean more chances to lock, so my friend never uses Guides.

Relations Mubashar

0
source share

All Articles