java查重工具类_Java各种高性能近似成员过滤器(布隆过滤器等查重过滤)

Fast Approximate Membership Filters in Java

68747470733a2f2f7472617669732d63692e636f6d2f4661737446696c7465722f6661737466696c7465725f6a6176612e7376673f6272616e63683d6d6173746572

The following filter types are currently implemented:

Bloom filter: the 'standard' algorithm

Blocked Bloom filter: faster than regular Bloom filters, but need a bit more space

Counting Bloom filter: allow removing entries, but need 4 times more space

Succinct counting Bloom filter: about half the space of regular counting Bloom filters; faster lookup but slower add / remove

Succinct counting blocked Bloom filter: same lookup speed as blocked Bloom filter

Cuckoo filter: 8 and 16 bit variants; uses cuckoo hashing to store fingerprints

Cuckoo+ filter: 8 and 16 bit variants, need a bit less space than regular cuckoo filters

Golomb Compressed Set (GCS): needs less space than cuckoo filters, but lookup is slow

Minimal Perfect Hash filter: needs less space than cuckoo filters, but lookup is slow

Xor filter: 8 and 16 bit variants; needs less space than cuckoo filters, with faster lookup

Xor+ filter: 8 and 16 bit variants; compressed xor filter

Reference: Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, Journal of Experimental Algorithmics (to appear).

Other Xor Filter Implementations

Password Lookup Tool

Included is a tool to build a filter from a list of known password (hashes), and a tool to do lookups. That way, the password list can be queried locally, without requiring a large file. The filter is only 650 MB, instead of the original file which is 11 GB. At the cost of some false positives (unknown passwords reported as known, with about 1% probability).

Generate the Password Filter File

Download the latest SHA-1 password file that is ordered by hash, for example the file pwned-passwords-sha1-ordered-by-hash-v4.7z (~10 GB) from https://haveibeenpwned.com/passwords with about 550 million passwords.

If you have enough disk space, you can extract the hash file (~25 GB), and convert it as follows:

mvn clean install

cat hash.txt | java -cp target/fastfilter*.jar org.fastfilter.tools.BuildFilterFile filter.bin

Converting takes about 2-3 minutes (depending on hardware). To save disk space, you can extract the file on the fly (Mac OS X using Keka):

/Applications/Keka.app/Contents/Resources/keka7z e -so

pass.7z | java -cp target/fastfilter*.jar org.fastfilter.tools.BuildFilterFile filter.bin

Both will generate a file named filter.bin (~630 MB).

Check Passwords

java -cp target/fastfilter*.jar org.fastfilter.tools.PasswordLookup filter.bin

Enter a password to see if it's in the list. If yes, it will (for sure) either show "Found", or "Found; common", which means it was seen 10 times or more often. Passwords not in the list will show "Not found" with more than 99% probability, and with less than 1% probability "Found" or "Found; common".

Internally, the tool uses a xor+ filter (see above) with 8 bits per fingerprint. Actually, 1024 smaller filters (segments) are made, the segment id being the highest 10 bits of the key. The lowest bit of the key is set to either 0 (regular) or 1 (common), and so two lookups are made per password. Because of that, the false positive rate is twice of what it would be with just one lookup (0.0078 instead of 0.0039). A regular Bloom filter with the same guarantees would be ~760 MB. For each lookup, one filter segment (so, less than 1 MB) are read from the file.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值