As @Kevin pointed out, the only way to know your data exactly is to compare and compare both methods, but from what you described, I donβt understand why it will be different from any other case where the row was either the main key in the table, or part of a unique index.
What can be said about the fact that your indexes will probably be larger, since they should store larger string values, and theoretically, comparisons for the index will take a little longer, but I would not advocate premature optimization, if I did it it would hurt.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I found that these are other factors related to the query, which usually leads to performance problems. For example, when you need to query a very large table size, say, hundreds of thousands of rows, sequential scanning becomes the best choice, so choosing a query planner can take a lot longer.
There are other mitigating strategies for this type of situation, such as query fragmentation and then UNION results (for example, manually modeling what would be done in Hive or Impala in the Hadoop domain).
Re: your concern about text indexing, while I'm sure there are times when a dataset creates a key distribution so that it works horribly, GUIDs like md5sums, sha1, etc. should be well indexed at all and do not require sequential scanning (if, as I mentioned above, you do not request a huge column of the table).
One of the factors affecting the operation of the index is the number of unique values. For this reason, a Boolean index in a table with a large number of rows is unlikely to help, since basically it will have a huge number of row collisions for any of the values ββ(true, false and potentially NULL) in the index. On the other hand, a GUID can have a huge number of collision-free values ββ(theoretically, because they are GUIDs).
Edit in response to a comment from OP:
Not literally the same, no. However, I say that they should have very similar performance for this particular case, and I donβt understand why it is worth optimizing the front, especially considering that you say that it will be a very difficult task.
You can always change things later if you encounter performance problems in your particular environment. However, as I mentioned earlier, I think that if you click this script, there are other things that are likely to lead to better performance than changing PK data types.
A UUID is a 128-bit data type (so 16 bytes), while text has 1 or 4 bytes of overhead plus the actual line length. For a GUID, this will mean a minimum of 33 bytes, but can vary significantly depending on the encoding used.
So, bearing in mind that the text UUID indices will be larger, since the values ββare larger, and comparing two strings compared to two numerical values ββis less effective in theory, but is not something that can lead to a huge difference in this case, according to at least not in ordinary cases.
I would not optimize the front when it will be a significant cost to do this and probably will never be needed. This bridge can be crossed if this time comes (although at first I would like to continue other query optimizations, as I mentioned above).
Regarding whether Postgres knows that a string is a GUID, it is definitely not the default. As far as this goes, this is just a unique string. But this should be good for most cases, for example. matching lines etc. If you need some kind of behavior that requires a special GUID (for example, some equality-based mappings, where GUID comparisons may differ from purely lexical ones), you can always attribute the string to the UUID, and Postgres will process the value as such during this request .
eg. for a foo text column, you can do foo::uuid to pass it to uuid .
There is also a module for generating uuid s, uuid-ossp .