url去重

2 篇文章 0 订阅

有10亿个url,每一个页面都非常大。怎样删除重复的url?

You have a billion urls, where each is a huge page. How do you detect the

duplicate documents?


1.页面巨大,所以把他们全部存入内存很费事。可以在内存中存储页面的简短表示。很明显使用hash是个好的选择。

2.由于urls太多了,所以不便将每一个页面同其他页面相比较。

基于以上两点,使用如下算法:

1.循环访问页面,计算每个的hashtable(这里应该是hash value)。

2.检查hash value是否在hash table中。存在时,表示url重复。否则保存url,将其加入hash table。

SOLUTION
Observations:
1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter
representation of pages in memory. A hash is an obvious choice for this.
2. Billions of urls exists so we don’t want to compare every page with every other page

(that would be O(n^2)).


Based on the above two observation we can derive an algorithm which is as follows:
1. Iterate through the pages and compute the hash table of each one.
2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it

is not, then keep the url and insert it in into the hash table.


算法能提供一系列独立的urls,但是这能在一台电脑上进行么?

每个url所占空间为:4bytes(页面hash to a 4bytes的值)+30bytes(url平均有30个字符,需要30bytes存储)

所占总空间:34 bytes * 1 billion = 31 gigabytes


This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?
»» How much space does each page take up in the hash table?
»» Each page hashes to a four byte value.
»» Each url is an average of 30 characters, so that’s another 30 bytes at least.
»» Each url takes up roughly 34 bytes.

»» 34 bytes * 1 billion = 31 gigabytes. We’re going to have trouble holding that all in memory!


由于所占空间太大,怎么办呢?

>>将要处理的文件分为几部分。那样必须处理文件的载入和卸载。

>>虽然存储空间不是问题,但是在时间上可能不满足要求。在磁盘上的hash table对 每次检查 和存入url 需要随机存取。进行查找数据和磁头的旋转需要毫秒级的时间。电梯算法能消除从磁道到磁道间的巨大随机跳跃。

What do we do?
»» We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
»» We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table
on disk would require a random access read for each check and write for storing
a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator

algorithms could elimate random bouncing from track to track.


或者将文件分布在多个机器上,这样要处理网路延时。假设有n台机器。

>>首先hash文件得到hash value v

>>将文件的hash table 放到第v%n 太机器上。

>>v/n是机器上hash table的值。

»» Or, we could split this up across machines, and deal with network latency. Let’s go with
this solution, and assume we have n machines.
»» First, we hash the document to get a hash value v
»» v%n tells us which machine this document’s hash table can be found on.

»» v / n is the value in the hash table on its machine.

我的理解是将文件的url进行hash,得到值v,相同的url的文件肯定具有相同的v,所以会被分配到同一台机器v%n上。在同一台机器上的重复的url文件具有相同的v/n值,如果出现了冲突,不同的url在同一台机器上也可能有相同的v/n值。在每个机器上将v/n值作为key,url值作为value构成hash table进行去重。最后将内存中去重后的hash table中的value值即url写入磁盘。合并磁盘中的各部分url文件,完成去重。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值