JAVA利用海明距离测算文本相似度

算法介绍:

SimHash也即相似hash,是一类特殊的信息指纹,常用来比较文章的相似度,与传统hash相比,传统hash只负责将原始内容尽量随机的映射为一个特征值,并保证相同的内容一定具有相同的特征值。而且如果两个hash值是相等的,则说明原始数据在一定概率下也是相等的。但通过传统hash来判断文章的内容是否相似是非常困难的,原因在于传统hash只唯一标明了其特殊性,并不能作为相似度比较的依据。

参考文献:

《Similarity estimation techniques from rounding algorithms》 作者:Moses S. Charikar 

文献摘要:

(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, Pr hεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for: A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity. A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance ( EMD

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值