simhash算法:海量千万级的数据去重
simhash算法及原理参考:
python实现:
python库simhash使用
(1) 查看simhash值
>>> fromsimhash import Simhash>>> print ‘%x‘ % Simhash(u‘I am very happy‘.split()).value
9f8fd7efdb1ded7f
Simhash()接收一个token序列,或者叫特征序列。
(2)计算两个simhash值距离
>>> hash1 = Simhash(u‘I am very happy‘.split())>>> hash2 = Simhash(u‘I am very sad‘.split())>>> print hash1.distance(hash2)
(3)建立索引
simhash被用来去重。如果两两分别计算simhash值,数据量较大的情况下肯定hold不住。有专门的数据结构,参考:http://www.cnblogs.com/maybe2030/p/5203186.html#_label4
from simhash importSimhash, SimhashIndex#建立索引
data ={
u‘1‘: u‘How are you I Am fine . blar blar blar blar blar Thanks .‘.lower().split(),
u‘2‘: u‘How are you i am fine .‘.lower().split(),
u‘3‘: u‘This is simhash test .‘.lower().split(),
}
objs= [(id, Simhash(sent)) for id, sent indata.items()]
index= SimhashIndex(objs, k=10) #k是容忍度;k越大,检索出的相似文本就越多#检索
s1 = Simhash(u‘How are you . blar blar blar blar blar Thanks‘.lower().split())printindex.get_near_dups(s1)#增加新索引
index.add(u‘4‘, s1)