搜索引擎–Scrapy爬虫使用Bloom Filter算法进行URL去重

最新推荐文章于 2022-01-11 10:51:32 发布

iteye_7527

最新推荐文章于 2022-01-11 10:51:32 发布

阅读量252

点赞数

文章标签： python 爬虫

主机环境：Ubuntu 13.04

Python版本：2.7.4

转载请标明：http://blog.yanming8.cn/archives/135

1、安装

`1`	`sudo pip install pybloomfiltermmap`

或者直接在github获取最新源代码，编译安装

`1`	`sudo python setup.py install`

2、使用方法

 
   1class pybloomfilter.BloomFilter(capacity : int, error_rate : float, filename : string)

Create a new BloomFilter object with a given capacity and error_rate.Note that we do not check capacity.This is important, because I want to be able to support logical OR and AND (see below). The capacity and error_rate then together serve as a contract—you add less than capacity items, and the Bloom Filter will have an error rate less than error_rate.

NEW: If you specifyNonefor the filename, then the bloom filter will be backed by malloc’d memory, rather than by a file.