概述
application
there are many data-mining problems which can be expressed as finding similar sets, such as:
- pages with similar words, e.g., for classification by topic
- recommendation systems, classification by people or movie
- entity resolution, different informs all point to one person
essential parts
shingling
minhashing
jaccard similarity measure
matrix represent
minihash
总体步骤
例子
性质
签名的相似性
implementation
将对矩阵的转置操作用不同的hash函数代替。
LSH
LSH means locality-sensitive hashing.
概述
- general idea: generate from the collection of all elements (signatures in our examples) a small list of candidates pairs : pairs of elements whose similarity must be evaluated.
- for signature matrices: hush columns to many buckets, and make elements of the same bucket candidate pairs.
候选pairs的规则
LSH具体阐述
例子
概率分析
总结
得到更多的signature(但是会有更多的空间占用与计算),可以有更大的b和r,能够获得更step的函数。