url去重

最新推荐文章于 2024-05-17 04:10:06 发布

cool_bomb

最新推荐文章于 2024-05-17 04:10:06 发布

阅读量4.3k

点赞数

分类专栏：算法 url

算法同时被 2 个专栏收录

96 篇文章 0 订阅

订阅专栏

url

2 篇文章 0 订阅

订阅专栏

有10亿个url，每一个页面都非常大。怎样删除重复的url？

You have a billion urls, where each is a huge page. How do you detect the

duplicate documents?

1.页面巨大，所以把他们全部存入内存很费事。可以在内存中存储页面的简短表示。很明显使用hash是个好的选择。

2.由于urls太多了，所以不便将每一个页面同其他页面相比较。

基于以上两点，使用如下算法：

1.循环访问页面，计算每个的hashtable（这里应该是hash value）。

2.检查hash value是否在hash table中。存在时，表示url重复。否则保存url，将其加入hash table。

SOLUTION
Observations:
1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter
representation of pages in memory. A hash is an obvious choice for this.
2. Billions of urls exists so we don’t want to compare every page with every other page

(that would be O(n^2)).

Based on the above two observation we can derive an algorithm which is as follows:
1. Iterate through the pages and compute the hash table of each one.
2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it

is not, then keep the url and insert it in into the hash table.

算法能提供一系列独立的urls，但是这能在一台电脑上进行么？

每个url所占空间为：4bytes(页面hash to a 4bytes的值)+30bytes(url平均有30个字符，需要30bytes存储)

所占总空间：34 bytes * 1 billion = 31 gigabytes

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?
»» How much space does each page take up in the hash table?
»» Each page hashes to a four byte value.
»» Each url is an average of 30 characters, so that’s another 30 bytes at least.
»» Each url takes up roughly 34 bytes.

»» 34 bytes * 1 billion = 31 gigabytes. We’re going to have trouble holding that all in memory!

由于所占空间太大，怎么办呢？

>>将要处理的文件分为几部分。那样必须处理文件的载入和卸载。

>>虽然存储空间不是问题，但是在时间上可能不满足要求。在磁盘上的hash table对每次检查和存入url 需要随机存取。进行查找数据和磁头的旋转需要毫秒级的时间。电梯算法能消除从磁道到磁道间的巨大随机跳跃。

What do we do?
»» We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
»» We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table
on disk would require a random access read for each check and write for storing
a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator

algorithms could elimate random bouncing from track to track.

或者将文件分布在多个机器上，这样要处理网路延时。假设有n台机器。

>>首先hash文件得到hash value v

>>将文件的hash table 放到第v%n 太机器上。

>>v/n是机器上hash table的值。

»» Or, we could split this up across machines, and deal with network latency. Let’s go with
this solution, and assume we have n machines.
»» First, we hash the document to get a hash value v
»» v%n tells us which machine this document’s hash table can be found on.

»» v / n is the value in the hash table on its machine.

我的理解是将文件的url进行hash，得到值v，相同的url的文件肯定具有相同的v，所以会被分配到同一台机器v%n上。在同一台机器上的重复的url文件具有相同的v/n值，如果出现了冲突，不同的url在同一台机器上也可能有相同的v/n值。在每个机器上将v/n值作为key，url值作为value构成hash table进行去重。最后将内存中去重后的hash table中的value值即url写入磁盘。合并磁盘中的各部分url文件，完成去重。

cool_bomb

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
url去重

有10亿个url，每一个页面都非常大。怎样删除重复的url？You have a billion urls, where each is a huge page. How do you detect theduplicate documents?1.页面巨大，所以把他们全部存入内存很费事。可以在内存中存储页面的简短表示。很明显使用hash是个好的选择。2.由于urls太多了
复制链接

扫一扫

专栏目录