标签数据库的设计

Tags: Database schemas

Recently, on del.icio.us mailinglist, I asked the question “Does anyone know the database schema of del.icio.us?”. I got a few private responses so I wanted to share the knowledge with the world.

The Problem: You want to have a database schema where you can tag a bookmark (or a blog post or whatever) with as many tags as you want. Later then, you want to run queries to constrain the bookmarks to a union orintersection of tags. You also want to exclude (say: minus) some tags from the search result.

Apparently there are three different solutions (Attention: If you are building a websites that allows users to tag, be sure to have a look at my performance tests as performance seems to be a problem on larger scaled sites.)

“MySQLicious” solution


mysqlicious sample datamysqlicious database stucture

In this solution, the schema has got just one table, it is denormalized.

I named this solution “MySQLicious solution” because MySQLiciousimports del.icio.us data into a table with this structure.

Intersection (AND)

Query for “search+webservice+semweb”:
SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
AND tags LIKE "%webservice%" 
AND tags LIKE "%semweb%"

Union (OR)

Query for “search|webservice|semweb”:

SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
OR tags LIKE "%webservice%" 
OR tags LIKE "%semweb%"

Minus

Query for “search+webservice-semweb”

SELECT * 
FROM `delicious` 
WHERE tags LIKE "%search%" 
AND tags LIKE "%webservice%" 
AND tags NOT LIKE "%semweb%"

Conclusion

The advantages of this solution:


Disadvantages:

  • You have a limit on the number of tags per bookmark. Normally you use a 256byte field in your DB (VARCHAR). Otherwise, if you took a text field or similar, the query times would slow down, I suppose
  • Patrice noticed that LIKE "%search" will also find tags with “websearch”. If you alter the query to LIKE " %search% " you end up having a messy solution: You have to add a space to the beginning of the tags value to make this work.

“Scuttle” solution

Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table. database structure of scuttle

Intersection (AND)

Query for “bookmark+webservice+semweb”:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR)

Query for “bookmark|webservice|semweb”:

Just leave out the HAVING clause and you have union:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

I guess the main advantage of this solution is that it is more normalized than the first solution, and that you can have unlimited number of tags per bookmark.

“Toxi” solution

image

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used bywordpress.

The queries are quite the same as in the “scuttle” solution.

Intersection (AND)

Query for “bookmark+webservice+semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR)

Query for “bookmark|webservice|semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id 
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

The advantages of this solution:

  • You can save extra information on each tag (description, tag hierarchy, …)
  • This is the most normalized solution (that is, if you go for 3NF: take this one :-)

Disadvantages:

  • When altering or deleting bookmarks you can end up with tag-orphans.

If you want to have more complicated queries like (bookmarks OR bookmark) AND (webservice or WS) AND NOT (semweb or semanticweb) the queries tend to become very complicated. In these cases I suggest the following query/computation process:

  1. Run a query for each tag appearing in your “tag-query”: SELECT b.id FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND b.id = bt.bookmark_id AND t.name = "semweb"
  2. Put each id-set from the result into an array (that is: in your favourite coding language). You could cache this arrays if you want..
  3. Constrain the arrays with union or intersection or whatever.

In this way, you can also do queries like (del.icio.us|delicious)+(semweb|semantic_web)-search. This type of queries (that is: the brackets) cannot be done by using the denormalized “MySQLicious solution”. 
This is the most flexible data structure and I guess it should scale pretty good (that is: if you do some caching).

Update May, 2006. This arcticle got quite some attention. I wasn’t really prepared for that! It seems people keep referring to it and even some new sites that allow tagging give credit to my articles. I think the real credit goes to the contributers of the different schemas: MySQLiciousscuttleToxi and to all the contributors of the comments (be sure to read them!)

P.S. Thanks to Toxi for sending me the queries for the three-table-schema, Benjamin Reitzammer for pointing me to a loughing meme article (a good reference for tag queries) and powerlinux for pointing me to scuttle.
Further reading


数据库设计是一个关键的步骤,它涉及到定义数据库结构、表格和关系的过程。在数据库设计中,标签体系是一种常见的方法,用于对数据进行分类和组织。 标签体系基于标签(或称为标记、标志)的概念,它们是用来描述数据的关键词或短语。标签可以在数据模型中作为属性或附加字段存在,并与数据实体相关联。 通过使用标签体系,可以实现对数据的多层次分类和组织。数据可以被分配多个标签,使其能够在不同的分类中出现。这种灵活性使得数据能够被更好地组织和检索,提供更好的查询和分析能力。 在数据库设计中,标签体系可以用于各种应用场景。例如,在博客系统中,可以使用标签对文章进行分类,使用户能够按照感兴趣的主题浏览文章。在电子商务网站中,商品可以被打上不同的标签,以便用户可以根据标签进行搜索和过滤。 要设计一个好的标签体系,需要考虑以下几点: 1. 标签应该具有一定的描述性和代表性,能够准确地描述数据。 2. 标签应该易于理解和使用,避免过于复杂或专业化的术语。 3. 标签应该能够覆盖到数据的不同方面和特征,以便能够在各种场景下使用。 4. 标签应该具有一定的层次结构,可以形成分类和子分类的关系,使数据能够更好地组织。 总之,标签体系在数据库设计中是一个有用的工具,可以帮助组织和分类数据,提供更好的查询和分析功能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值