针对文章内容进行去重

最新推荐文章于 2024-01-02 17:20:31 发布

Crazy__Hope

最新推荐文章于 2024-01-02 17:20:31 发布

阅读量2.5k

点赞数 2

分类专栏：爬虫文章标签：去重爬虫 hash

本文链接：https://blog.csdn.net/Crazy__Hope/article/details/79053213

版权

爬虫专栏收录该内容

12 篇文章 2 订阅

订阅专栏

最近公司项目抓取新闻板块内容，但是多个网站之间，重复新闻的概率很高(因为大多数新闻网站的内容都是互相爬取的)，所以我自己想了一个针对内容的去重方法。
大概思路是: 取文章当中最长的3句话(当然这个值可以自己随意指定)，默认只接收列表，然后进行hash，把hash值存储在redis中，因为相同的内容生成hash值是一样，其实去重原理就跟scrapy-redis很类似…
具体实现代码如下:

import hashlib


def sim_hash(content, title=None):
    if not isinstance(content, list):
        raise ("ValueError: Please send a list object")
    if len(content) <= 3:
        if title is None:
            raise ("list is too short!")
        else:
            return hashlib.md5("".join(title).encode("utf-8")).hexdigest()
    new_dict = dict()
    for i in content:
        # 把内容的索引当作key 长度当作value
        new_dict[content.index(i)] = len(str(i))

    data = []
    # 将长度排序  倒序
    for i in sorted(new_dict.values(), reverse=True)[:3]:
        data.append([content[k] for k, j in new_dict.items() if i == j and content[k] not in data][0])
        # 这句话跟下面这段代码效果是一样的
        # for k, j in new_dict.items():
        #     # key是索引，j 是长度
        #     if i == j:
        #         if content[k] not in data:
        #             data.append(str(content[k]))
    return hashlib.md5("".join(data[:3]).encode("utf-8")).hexdigest()