当内存是瓶颈时，HashSet的一个替代类

最新推荐文章于 2018-11-29 18:20:01 发布

weixin_34344403

最新推荐文章于 2018-11-29 18:20:01 发布

阅读量177

点赞数

文章标签： java python

原文链接：https://my.oschina.net/zhuguowei/blog/406770

版权

2019独角兽企业重金招聘Python工程师标准>>>

在菜谱抓取过程中，需要对已抓取的url进行去重，一开始使用的HashSet来去重，但占用内存较大。于是改用BloomFilter(goolge guava jar包中的一个工具类)来去重。

下面是对HashSet与BloomFilter的内存占用与误报率（明明不在集合中，却被当做已存在）的比较。

比较内存占用：

分别插入90万个由六位数字字符组成的字符串到HashSet与BloomFilter中。

Set<String> set = new HashSet<>();
for(int i=10_0000; i<100_0000; i++)
        set.add(""+i);

//第一个参数表示将字符串插入到集合中 第二个参数表示预期插入数量 第三个表示可以接受的误报率
BloomFilter<CharSequence> bf = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), 100_0000, 0.001);
for(int i=10_0000; i<100_0000; i++)
    bf.put(i+"");

通过一个工具类计算得到它们内存占用量分别为：

set memory： 87,588,704 （约为87M）
bloom filter memory: 1,797,624 （约为1M）

再比较误报率：

int falseHitCount = 0;
Set<String> set = new HashSet<>();
for(int i=10_0000; i<100_0000; i++){
    if(set.contains(i+"")) //插入set中之前 先判断是否存在
        falseHitCount ++ ;
    set.add(i+"");
}

int falseHitCount = 0;
BloomFilter<CharSequence> bf = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), 100_0000, 0.001);
for(int i=10_0000; i<100_0000; i++){
    if(bf.mightContain(i+"")) // 插入bloom filter中之前 先判断是否存在
        falseHitCount++;
    bf.put(i+"");
}

hashset flase hit count： 0
bloom filter false hit count : 54

即HashSet不存在误报的情况，而构造BloomFilter时第三个参数指定了误报率为千分之一，而实际的误报率为54 / 90_0000.

总结：

若业务可以容忍个别的误报（如漏抓个别菜谱）的话，可以考虑使用BloomFilter来代替HashSet。

补充：

计算对象大小，可以参考此篇博文：

http://blog.csdn.net/xieyuooo/article/details/7068216

Guava maven坐标：

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>15.0</version>
</dependency>

转载于:https://my.oschina.net/zhuguowei/blog/406770

weixin_34344403

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
当内存是瓶颈时，HashSet的一个替代类

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫

当内存是瓶颈时，HashSet的一个替代类

“相关推荐”对你有帮助么？