Simhash实践

在搜索引擎中,一个重要的工作就是发现网页的相似性,本次,我们介绍其中一个基于词表示的重复内容检测方法——Simhash。

在对比网页相似性的时候,最关键的问题是效率问题,大量的网页的比对往往给服务器造成很大负载。

  • Simhash combines the advantages of the word-based similarity measures with the efficiency of fingerprints based on hashing.
  • Similarity of two pages as measured by the cosine correlation measure is proportional to the number of bits that are the same in the simhash fingerprints
more details: Charikar, M.S.  Similarity estimation techniques from rounding algorithms. In ACM symposium on theory of computing (STOC'02). 2002

Step:
  1. Process the document into a set of features with associated weights. We will assume the simple case where the features are words weighted by their frequency.
  2. Generate a hash value with b bits (the desired size of fingerprint) for each word. The hash value should be unique for each word.
  3. In b-dimensional vector V, update the components of the vector by adding the weight for a word to every component for which the corresponding bit in the world's hash value is 1, and subtracting the weight if the value is 0.
  4. After all words have been processed, generate a b-bit fingerprint by setting the ith bit to 1 if ith component of V is positive, or 0 otherwise.
Example:

C#实现:
        private void calculateTF()
        {
            text = TextProcess.preproccess(text);
            string[] cf = text.Split(' ');
            for (int i = 0; i < cf.Length; i++)
            {
                if ("".Equals(cf[i])) continue;
                if (!termFreq.ContainsKey(cf[i]))
                    termFreq.Add(cf[i], 1);
                else
                {
                    termFreq[cf[i]]++;
                }
            }
        }

        private void hashFunction(string key)
        {
            byte hashkey = 0;
            for (int i = 0; i < key.Length; i++)
            {
                hashkey += (byte)key[i];
            }
            if (!hash.ContainsKey(hashkey))
            {
                hash.Add(hashkey, key);
            }
            else
            {
                while ((hash.ContainsKey(hashkey)) && (!hash[hashkey].Equals(key)))
                {
                    hashkey++;
                }
                hash.Add(hashkey, key);
            }
        }
        private int getIntegerSomeBit(byte _Resource, int _Mask)
        {
            return _Resource >> _Mask & 1;
        }
        private void testHashFunction()
        {
            foreach (string key in termFreq.Keys)
            {
                hashFunction(key);
            }
        }
        private void calcSum()
        {
            int[] sums = new int[8];
            for (int i = 0; i < sums.Length; i++)
            {
                int sum = 0;
                foreach (byte hashkey in hash.Keys)
                {
                    int bitval = getIntegerSomeBit(hashkey, 7 - i);
                    if (bitval == 1)
                    {
                        sum += termFreq[hash[hashkey]];
                    }else
                        sum -= termFreq[hash[hashkey]];
                }

                sums[i] = sum >= 0 ? 1 : 0;
            }
        }
Input:
Tropical fish include fish found in tropical environments around the world, including both freshwater and salt water species. Tropical fish are popular as aquarium fish, due to their often bright coloration. In freshwater fish, this coloration typically derives from iridescence, while salt water fish are generally pigmented.
Fishkeepers often use the term tropical fish to refer particularly to those requiring fresh water, with saltwater tropical fish referred to as marine fish. Tropical fish kept for home aquaria include the following:
Wild-caught specimens.
Single-species individuals born in captivity. The latter category includes lines selectively bred for special physical features, such as long fins, or particular colorations, such as albino.
Hybrids of more than one species.
Recreational SCUBA divers are often enthusiasts of tropical fish as well. Some keep lists of fish species they have observed while diving, especially in tropical marine environments.
Output:
词频:
Vector V with weights and V


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值