第一种,使用shingling算法
参考:
http://liangqingyu.com/blog/2014/12/03/%E7%BB%86%E8%AF%B4%E5%9E%82%E7%9B%B4%E5%9E%8B%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%EF%BC%88%E5%8D%81%EF%BC%89%E3%80%90%E5%8E%BB%E9%87%8D%E6%A8%A1%E5%9D%97%E4%B9%8Bshingling%E3%80%91.html
第二种,使用simhash算法
参考:
http://liangqingyu.com/blog/2014/12/04/%E7%BB%86%E8%AF%B4%E5%9E%82%E7%9B%B4%E5%9E%8B%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%EF%BC%88%E5%8D%81%E4%B8%80%EF%BC%89%E3%80%90%E5%8E%BB%E9%87%8D%E6%A8%A1%E5%9D%97%E4%B9%8BsimHash%E3%80%91.html