Optimized table structure for tag table

Consider these 3 table structures. What would be best to fulfill these requests.

Structure 1 - TagID as int with connection table

Article ------- ArticleID int Article_Tag ------------ ArticleTagID int ArticleID int TagID int Tag --- TagID int TagText varchar(50) 

Structure 2 - tags only in the Join table as a string

 Article ------- articleID int Article_Tag ----------- articleTagID int articleID int tag varchar(50) 

Structure 3 - Tag as text with PK

 Article ------- ArticleID int Article_Tag ------------ ArticleTagID int ArticleID int Tag varchar(50) Tag --- Tag varchar(50) 

Request examples:

 Select articleID from Article a inner join Article_tag at on a.articleID = at.articleID and tag = 'apple' Select tag from Tags -- or, for structure 2 Select distinct tag from article_tag 
+4
source share
7 answers

It depends on whether you ever want to change the tag text globally. Of course, you could set a wide UPDATE to Article_Tag , but if you need to do this, the ability to simply update the value in Tag will be easier. Some servers offer automatic updates (for example, ON UPDATE CASCADE in SQL Server), but they are not necessarily cheap (there should still be UPDATE many rows and any indexes).

But if you don't need it , it should be a little faster with the literal in Article_Tag , since it can remove the connection - many times. Obviously index it, etc.

The extra space required for repeated literal is a factor, but disk space is usually cheaper than a faster server.

As for the primary key; if you do not have other data to store, why do you even need a table? You can use DISTINCT on Article_Tag just as easily, especially if Tag should be indexed (so it should be pretty cheap). ( edit . Bill Carvin correctly points out the merits of being able to have matching tags, not just current tags).

+5
source

Using TagText as a primary key will have the advantage that you can get article tags with fewer joins:

 SELECT * FROM Article_Tag WHERE Article_ID = ? 

It would be a drawback that tag lines take up more space than integers, so there will be more storage for Article_Tag and its indexes. This takes up more disk space and also requires more memory to cache the index.

+4
source

I would go with 1 every time. It is fully normalized, and since you are using synthetic PK, you can change the tag name with a single line update.

The only advantage otherwise is the reduction in the number of joins. This is an optimization that we all know that you should only do after measurement. If you were sure that structure 1 was not fast enough, you would not ask, right?

Now there is not much difference between 2 and 3, but, as Bill Carwin notes, 3 has advantages in terms of cascading updates. Moreover, an additional table does not lose anything.

So, I would say go with 1. If there is a measurable (i.e., acceptable) performance problem, then 3 will be perfectly acceptable. In any case, it would not be easy to migrate later.

+2
source

Welcome modesty. Or Lokalugust, or Sean, or now you call yourself. Just keep in mind that there are no more hacker badges, so there is nothing to win here :)

0
source

You must match the TagText to TagId in the code (and match the cache in memory anyway) and pass the preprogrammed TagId to your request.

There is also no reason why you need a synthetic key for the Article_Tag table. You must use a composite primary key ( ArticleId , TagId ).

So, I say # 1 with a minor tweak mentioned above.

0
source

I would go for Structure 2, perhaps just by calling the table Article_Tag - Tags .

0
source

A table with AUTO_INCREMENT PK will not scale. Forget TagID as INTEGER and replace it with BINARY (16), enough for the MD5 TagText checksum.

And with the proper cache level, your SQL query will not need the TagText column as much as needed.

0
source

All Articles