How to implement a tag system

I was wondering what is the best way to implement a tag system, such as the one used on SO. I thought about it, but I can't come up with a good scalable solution.

I was thinking of a basic 3-table solution: having a tags table, articles tables and a tag_to_articles table.

Is this the best solution to this problem or are there alternatives? Using this method, the table will be very large in time, and it is not very efficient to search, I guess. On the other hand, it is not so important that the request is executed quickly.

+72
algorithm system tagging
Nov 27 '09 at 19:35
source share
5 answers

I find you will find this blog post interesting: Tags: database schemas

Problem: you want to have a database schema where you can bookmark bookmarks (or a blog post or something else) with as many tags as you want. Later, you want to run queries to restrict bookmarks to merge or intersect tags. You also want to exclude (say: minus) some tags from the search result.

"MySQLicious" solution

In this solution, the scheme has only one table; it is denormalized. This type is called the "MySQLicious solution" because MySQL.com imports del.icio.us data into a table with this structure.

enter image description hereenter image description here

Intersection (AND) Request for "search + webservice + semweb":

 SELECT * FROM `delicious` WHERE tags LIKE "%search%" AND tags LIKE "%webservice%" AND tags LIKE "%semweb%" 

Union (OR) Request for "search | webservice | semweb":

 SELECT * FROM `delicious` WHERE tags LIKE "%search%" OR tags LIKE "%webservice%" OR tags LIKE "%semweb%" 

Minus Search query for "web-service-semweb"

 SELECT * FROM `delicious` WHERE tags LIKE "%search%" AND tags LIKE "%webservice%" AND tags NOT LIKE "%semweb%" 



Scuttle Solution

Scuttle organizes its data in two tables. This scCategories table is tag -table and has a foreign key for bookmarking -table.

enter image description here

Intersection (AND) Request for "bookmark + webservice + semweb":

 SELECT b.* FROM scBookmarks b, scCategories c WHERE c.bId = b.bId AND (c.category IN ('bookmark', 'webservice', 'semweb')) GROUP BY b.bId HAVING COUNT( b.bId )=3 

First, a search is made for all combinations of tag labels, where the tag is "bookmark", "webservice" or "semweb" (c.category IN ("bookmark", "webservice", "semweb")), and then bookmarks are simply taken into account, which have all three tags found (HAVING COUNT (b.bId) = 3).

Union (OR) Request for "bookmark | webservice | semweb": Just leave a HAVING clause and you have a union:

 SELECT b.* FROM scBookmarks b, scCategories c WHERE c.bId = b.bId AND (c.category IN ('bookmark', 'webservice', 'semweb')) GROUP BY b.bId 

Minus (exception) Request for "bookmark + webservice-semweb", that is: bookmark AND webservice AND NOT semweb.

 SELECT b. * FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND (c.category IN ('bookmark', 'webservice')) AND b.bId NOT IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb') GROUP BY b.bId HAVING COUNT( b.bId ) =2 

Leaving HAVING COUNT, you will receive a request for "bookmark | webservice-semweb".




Toxi Solution

Toxi came up with a three-table structure. Bookmarks and tags are linked to n-to-m through the "tagmap" table. Each tag can be used with different bookmarks and vice versa. This DB schema is also used by wordpress. The queries are exactly the same as in the scuttle solution.

enter image description here

Intersection (AND) Query for "bookmark + webservice + semweb"

 SELECT b.* FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND (t.name IN ('bookmark', 'webservice', 'semweb')) AND b.id = bt.bookmark_id GROUP BY b.id HAVING COUNT( b.id )=3 

Union (OR) Request for "bookmark | webservice | semweb"

 SELECT b.* FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND (t.name IN ('bookmark', 'webservice', 'semweb')) AND b.id = bt.bookmark_id GROUP BY b.id 

Minus (exception) Request for "bookmark + webservice-semweb", that is: bookmark AND webservice AND NOT semweb.

 SELECT b. * FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND (t.name IN ('Programming', 'Algorithms')) AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python') GROUP BY b.id HAVING COUNT( b.id ) =2 

Leaving HAVING COUNT, you will receive a request for "bookmark | webservice-semweb".

+92
Nov 27 '09 at 20:18
source share

Nothing wrong with a three-table solution.

Another option is to limit the number of tags that can be applied to the article (for example, 5 in SO), and add them directly to the article table.

Database normalization has its advantages and disadvantages, just as things with hard wiring into one table have advantages and disadvantages.

Nothing is said that you cannot get around both. This contradicts the paradigms of relational databases for repeating information, but if the goal is performance, you may need to break the paradigms.

+8
Nov 27 '09 at 19:41
source share

Your proposed three-step implementation will work for tags.

The stack overflow uses, however, a different implementation. They save the tags in the varchar column of the message table as plain text and use full text indexing to retrieve messages matching the tags. For example posts.tags = "algorithm system tagging best-practices" . I'm sure Jeff mentioned it somewhere, but I forgot where.

+5
Nov 28 '09 at 10:27
source share

The proposed solution is the best, and not the only, practical way that I can come up with to address the many-to-many relationship between tags and articles. So my vote is "yes, it's still better." I would be interested in any alternatives.

+3
Nov 27 '09 at 19:40
source share

If your database supports indexable arrays (e.g. PostgreSQL, for example), I would recommend a completely denormalized solution - store the tags as an array of strings in a single table. If not, then a secondary table matching objects with tags is the best solution. If you need to store additional information regarding tags, you can use a separate tag table, but it makes no sense to enter a second connection for each tag search.

+1
Nov 27 '09 at 22:52
source share