PostgreSQL using UUID vs Text as primary key

Question

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database uses GUIDs as primary keys and saves them as a text field.

My initial reaction to this is that trying to perform any minimum Cartesian union would be an indexing nightmare, trying to find all the matching records. However, perhaps my limited understanding of database indexing is incorrect here.

I think we should use UUIDs, as they are stored as a binary representation of the GUID, where there is no text, and the amount of indexing you get in the Text column is minimal.

It would be a significant project to change them, and I wonder if it is worth it?

+7

uuid postgresql primary-key

Scottie Nov 20 '15 at 21:58

source share

2 answers

When working with UUIDs, store them as the uuid data type. Always. There is simply no good reason to even consider text as an alternative. In any case, the input and output is carried out using a text view. Casting is very cheap.

The text data type requires more space in RAM and disk, is slower to process, and more prone to errors. @hampson's answer provides most of the rationale. Oddly enough, he does not seem to come to the same conclusion.

Everyone asked, answered, and discussed this before. Related questions on dba.SE with a detailed explanation:

`bigint` ?

You may not need UUIDs (GUIDs) at all. Consider bigint . It takes only 8 bytes and is faster in every aspect. This range is often underestimated:

 -9223372036854775808 to +9223372036854775807

This is 9.2 million million million positive numbers. Tens or hundreds of millions are not even close.

IOW, if you burn 1 million IDs per second (this is an insanely high number), you can continue to do this for 292471. And then another 292,471 years for negative numbers.
UUID is really intended only for distributed systems and other special cases.

+5

Erwin brandstetter Nov 21 '15 at 0:43

source share

khampson · Accepted Answer · 2015-11-20T22:41:32+0000

As @Kevin pointed out, the only way to know your data exactly is to compare and compare both methods, but from what you described, I don’t understand why it will be different from any other case where the row was either the main key in the table, or part of a unique index.

What can be said about the fact that your indexes will probably be larger, since they should store larger string values, and theoretically, comparisons for the index will take a little longer, but I would not advocate premature optimization, if I did it it would hurt.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I found that these are other factors related to the query, which usually leads to performance problems. For example, when you need to query a very large table size, say, hundreds of thousands of rows, sequential scanning becomes the best choice, so choosing a query planner can take a lot longer.

There are other mitigating strategies for this type of situation, such as query fragmentation and then UNION results (for example, manually modeling what would be done in Hive or Impala in the Hadoop domain).

Re: your concern about text indexing, while I'm sure there are times when a dataset creates a key distribution so that it works horribly, GUIDs like md5sums, sha1, etc. should be well indexed at all and do not require sequential scanning (if, as I mentioned above, you do not request a huge column of the table).

One of the factors affecting the operation of the index is the number of unique values. For this reason, a Boolean index in a table with a large number of rows is unlikely to help, since basically it will have a huge number of row collisions for any of the values (true, false and potentially NULL) in the index. On the other hand, a GUID can have a huge number of collision-free values (theoretically, because they are GUIDs).

Edit in response to a comment from OP:

Not literally the same, no. However, I say that they should have very similar performance for this particular case, and I don’t understand why it is worth optimizing the front, especially considering that you say that it will be a very difficult task.

You can always change things later if you encounter performance problems in your particular environment. However, as I mentioned earlier, I think that if you click this script, there are other things that are likely to lead to better performance than changing PK data types.

A UUID is a 128-bit data type (so 16 bytes), while text has 1 or 4 bytes of overhead plus the actual line length. For a GUID, this will mean a minimum of 33 bytes, but can vary significantly depending on the encoding used.

So, bearing in mind that the text UUID indices will be larger, since the values are larger, and comparing two strings compared to two numerical values is less effective in theory, but is not something that can lead to a huge difference in this case, according to at least not in ordinary cases.

I would not optimize the front when it will be a significant cost to do this and probably will never be needed. This bridge can be crossed if this time comes (although at first I would like to continue other query optimizations, as I mentioned above).

Regarding whether Postgres knows that a string is a GUID, it is definitely not the default. As far as this goes, this is just a unique string. But this should be good for most cases, for example. matching lines etc. If you need some kind of behavior that requires a special GUID (for example, some equality-based mappings, where GUID comparisons may differ from purely lexical ones), you can always attribute the string to the UUID, and Postgres will process the value as such during this request .

eg. for a foo text column, you can do foo::uuid to pass it to uuid .

There is also a module for generating uuid s, uuid-ossp .

PostgreSQL using UUID vs Text as primary key

bigint ?

More articles:

`bigint` ?