mysql 存储 url_如何在MySQL中存储URL

bd96500e110b49cbb3cd949968f18be7.png

I need to store potentially 100s of millions URLs in a database. Every URL should be unique, hence I will use ON DUPLICATE KEY UPDATE and count the duplicate URLs.

However, I am not able to create an index on the URL field as my varchar field is 400 characters. MySQL is complaining and saying; "#1071 - Specified key was too long; max key length is 767 bytes". (Varchar 400 will take 1200 bytes)

What is the best way to do this, if you need to process minimum 500000 URLs per day in a single server?

We are already thinking using MongoDB for the same application, so we can simply query MongoDB and find the duplicate URL, and update the row. However, I am not in favor of solving this problem using MongoDB , and I'd like to use just MySQL at this stage as I'd like to be as lean as possible in the beginning and finish this section of the project much faster. (We haven't played with MongoDB yet and don't want to spend time at this stage)

Is there any other possibility doing this using less resources and time. I was thinking to get MD5 hash of the URL and store it as well. And I can make that field UNIQUE instead. I know, there will be collision but it is ok to have 5-10-20 duplicates in the 100 million URLs, if that's the only problem.

Do you have any suggestions? I also don't want to spend 10 seconds to insert just one URL, as it will process 500k URLs per day.

What would you suggest?

Edit: As per the request this is the table definition. (I am not using MD5 at the moment, it is for testing)

mysql> DESC url;

+-------------+-----------------------+------+-----+-------------------+-----------------------------+

| Field | Type | Null | Key | Default | Extra |

+-------------+-----------------------+------+-----+-------------------+-----------------------------+

| url_id | int(11) unsigned | NO | PRI | NULL | auto_increment |

| url_text | varchar(400) | NO | | | |

| md5 | varchar(32) | NO | UNI | | |

| insert_date | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |

| count | mediumint(9) unsigned | NO | | 0 | |

+-------------+-----------------------+------+-----+-------------------+-----------------------------+

5 rows in set (0.00 sec)

解决方案

According to the DNS spec the maximum length of the domain name is :

The DNS itself places only one restriction on the particular labels

that can be used to identify resource records. That one restriction

relates to the length of the label and the full name. The length of

any one label is limited to between 1 and 63 octets. A full domain

name is limited to 255 octets (including the separators).

255 * 3 = 765 < 767 (Just barely :-) )

However notice that each component can only be 63 characters long.

So I would suggest chopping the url into the component bits.

Probably this would be adequate:

protocol flag ["http" -> 0 ] ( store "http" as 0, "https" as 1, etc. )

subdomain ["foo" ] ( 255 - 63 = 192 characters : I could subtract 2 more because min tld is 2 characters )

domain ["example"], ( 63 characters )

tld ["com"] ( 4 characters to handle "info" tld )

path [ "a/really/long/path" ] ( as long as you want -store in a separate table)

queryparameters ["with=lots&of=query&parameters=that&goes=on&forever&and=ever" ] ( store in a separate key/value table )

portnumber / authentication stuff that is rarely used can be in a separate keyed table if actually needed.

This gives you some nice advantages:

The index is only on the parts of the url that you need to search on (smaller index! )

queries can be limited to the various url parts ( find every url in the facebook domain for example )

anything url that has too long a subdomain/domain is bogus

easy to discard query parameters.

easy to do case insensitive domain name/tld searching

discard the syntax sugar ( "://" after protocol, "." between subdomain/domain, domain/tld, "/" between tld and path, "?" before query, "&" "=" in the query)

Avoids the major sparse table problem. Most urls will not have query parameters, nor long paths. If these fields are in a separate table then your main table will not take the size hit. When doing queries more records will fit into memory, therefore faster query performance.

(more advantages here).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值