What are the benefits of using numeric row identifiers in MySQL?

I'm new to SQL, and thinking about my datasets relationally rather than hierarchically is a big shift for me. I hope to get an idea of ​​performance (both in terms of storage size and processing speed), as well as the complexity of designing the use of numeric identifiers of strings as a primary key instead of string values ​​that are more significant.

In particular, this is my situation. I have one table ("parent") with several hundred rows, for which one column is a string identifier (10-20 characters), which, apparently, is a natural choice for the main key of the table. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so that I can create a foreign key constraint for the child table). (In fact, I have several tables of both types with a complex set of links among them, but I think it makes sense.)

So, I need a column in a child table that gives an identifier to the rows in the parent table. Naively, it seems that creating a column as something like VARCHAR (20) to refer to the "natural" identifier in the first table will lead to huge success in performance both in terms of storage space and query time, and therefore I have to include the numeric (probably auto_increment) column id in the parent table and use it as a link in the child. But since the data that I load into MySQL no longer has such numeric identifiers, this means an increase in the complexity of my code and more opportunities for errors. Even worse, since I'm doing data mining, I might want to laugh with the values ​​in the parent table without doing anything with the child table, so I have to be careful not to accidentally break the relationship by deleting rows and losing my numeric identifier (I would probably solved this by storing the identifiers in the third table or something stupid like that.)

So my question is: are there any optimizations that I may not know about that that mean that a column with hundreds of thousands or millions of rows repeating several hundred row values ​​over and over is less wasteful than it seems at first glance? I am not opposed to a modest compromise of efficiency in favor of simplicity because it is for data analysis and not for production, but I worry that I will roll myself into a corner where everything I want to do takes a huge amount of time to run.

Thanks in advance.

+6
source share
4 answers

I would not be primarily concerned about cosmic considerations. An integer key usually takes four bytes. Varchar will take from 1 to 21 bytes, depending on the length of the string. So, if most of them are just a few characters long, the varchar(20) key takes up more space than the integer key. But not an extraordinary amount anymore.

Both, by the way, can use indexes. Thus, the access speed is not particularly different (of course, keys with long / variable lengths will have a slight effect on index performance).

There are more serious reasons to use a primary key with automatic addition.

  • You know which values ​​were recently inserted.
  • If duplicates appear (which should not happen for the primary key, of course), it is easy to determine what to delete.
  • If you decide to change the "name" of one of the records, you do not need to update all the tables that reference it.
  • You do not need to worry about leading spaces, spaces and other features.

You pay for the extra features with another four bytes in the entry for something that might seem inappropriate. However, such effectiveness is premature and probably not worth the effort.

+8
source

Gordon is right (which is not surprising).

Here are considerations that you don’t worry about, in my opinion.

When dealing with dozens of megalogs or less, the storage space is mostly free. Don't worry about the difference between INT and VARCHAR (20), and don't worry about the cost of disk space to add an extra column or two. It just doesn't matter when you can buy decent terabyte discs for around $ 100.

INT and VARCHARS can be indexed quite efficiently. You will not see much time difference.

This is what you should worry about.

There is one major error in index performance that can get hit by character indexes. You need columns on which you create indexes that will be declared NOT NULL , and you never want to make a query that says

  WHERE colm IS NULL /* slow! */ 

or

  WHERE colm IS NOT NULL /* slow! */ 

This type of lesion affects indexing. In a similar vein, your performance will go a long way if you apply functions to columns in your search. For example, do not do this because it is too hit indexing.

  WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */ 

Another question to ask yourself. Will you uniquely identify rows in your helper tables, and if so, how? Do they have some kind of primary key of a natural compound? For example, you can have these columns in a "child" table.

  parent varchar(20) pk fk to parent table birthorder int pk name varchar(20) 

Then you can have lines like ...

  parent birthorder name homer 1 bart homer 2 lisa homer 3 maggie 

But, if you tried to insert the fourth row here, like this,

  homer 1 badbart 

you will encounter the primary key because (homer, 1) is busy. It is probably a good idea to work with primary keys for your auxiliary tables.

Character strings containing numbers are funny. For example, "2" appears after "101". You should be aware of this.

+2
source

The main advantage that you get from numerical values ​​is that it’s easier for them to index. Indexing is a process that MySQL uses to simplify finding values.

Usually, if you want to find a value in a group, you need to go through a group that is looking for your value. It is slow and has the worst case of O (n). If instead your data was in a good searchable format - like a binary search tree , if it could be found in O (lon n) faster.

Indexing is the process that MySQL uses to prepare data for a search; it generates search trees and other smart add-ons that quickly find data. This greatly speeds up the search. However, to do this, you need to compare the value you are looking for for the various key values ​​to determine if your value is greater or less than the key.

This comparison can be performed for non-numeric values. However, comparing non-numeric values ​​is much slower. If you want to quickly find data, it is best if you have a solid "key" that you use.

0
source

A numeric string identifier has many advantages over a string-based identifier. Most of them are mentioned in other answers. 1. One of them is indexing. Primary keys are indexed by default in a relational database. Thus, having a numeric key is always more efficient. 2. Numeric fields are stored much more efficiently. 2. Connections are much faster using the number keys. 3. The row identifier may be a foreign key. Numeric identifiers are compact for storage, which makes them efficient 4. I think that using auto-increment for the primary key also has its advantages.

-Thank you _san

0
source

All Articles