Python爬虫下载新闻，聚合新闻，打造自己的今日头条（1）

Jagua

已于 2024-11-11 11:15:17 修改

阅读量1k

点赞数 13

文章标签： python 爬虫中文分词

于 2024-11-11 11:11:46 首次发布

本文链接：https://blog.csdn.net/Jagua/article/details/143671357

版权

为了看新闻资讯，手机装了N多新闻类App。

有时觉得不方便；有的APP点开的越来越少；有时被某个标题或图片吸引忘了看正经新闻，结果app反复推送这类新闻....

新闻APP在今日头条效应之下，都走上算法推送，根据用户的兴趣推送新闻。有天觉得烦了，能不能自己做一个，方便快速了解新闻。

说干就干，做这个事其实不难，一点也不简单，主要分三步：

准备一些种子链接，常见的新闻网站；
爬虫抓取这些链接，包括标题、摘要、全文；
聚合展现，分页展现；

先看看样子，简洁版链接，常规版链接。各位看看是否OK。如果OK继续往下看，不OK就跳过。

爬虫技术

1，爬虫下载网站源码，简单网站用eTree，其它用Selenium + Firefox。CSDN上有很多Selenium/Firefox安装教程，此处略过。安装好之后，获取网站源码。

def fun_initFirefox(url):
    start = datetime.datetime.now()
    options = webdriver.FirefoxOptions()
    options.add_argument('-headless')

    ### 内存优化  ####
    profile = webdriver.FirefoxProfile()
    profile.set_preference("permissions.default.image", 2)  #禁止下载图片，根据情况使用
    # 禁用浏览器缓存
    profile.set_preference("network.http.use-cache", False)
    profile.set_preference("browser.cache.memory.enable", False)
    profile.set_preference("browser.cache.disk.enable", False)
    profile.set_preference("browser.sessionhistory.max_total_viewers", 3)
    profile.set_preference("network.dns.disableIPv6", True)
    profile.set_preference("Content.notify.interval", 750000)
    profile.set_preference("content.notify.backoffcount", 3)

    brower = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver',options=options)
    try:
        brower.get(url)
    except:
        #print("wrong url",url)
        brower.quit()
        return 0
    brower.implicitly_wait(10)

    end = datetime.datetime.now()
    print('\n', "Firefox cost time:", (end - start).seconds, "s")

    return brower

browser = fun_initFirefox(url)

page_source = browser.page_source

browser.quit()//注意，Selenium比较消耗内存，要及时退出。

2，抓取新闻

每个网站结构不一样，需要对源码仔细分析，找到规律抓取新闻。结构较好的网站可以用browser.find_element_by_xpath抓取，复杂的用正则表达式抓取。

这部分比较复杂，后面详细讲，记得点个关注。

3，生成摘要

我的目标是快速浏览新闻，有兴趣的再查看原文。很多新闻没有摘要，需要自己生成。步骤：

分词：用jieba分词
分句：按逗号、分号、问号、感叹号分号
统计每句出现的词频，给句子打分
句字高低排序，取前面250个字生成摘要，代码如下：

#字典排序
def key_sens(dicts, count):

    sens = []
    sorted_dic = sorted([(k, v) for k, v in dicts.items()], reverse=True)

    #定义集合去重元素
    tmp_set = set()

    for item in sorted_dic:tmp_set.add(item[1])

    for list_item in sorted(tmp_set, reverse=True)[:count]:
        for dic_item in sorted_dic:
            if dic_item[1] == list_item and len(dic_item[0]) > 5:
                sens.append(dic_item[0])

    return sens

#输入:text 全文 string，输出:摘要 list
def gen_summ2(text):

    import jieba
    import numpy as np

    if len(text) < 250:return text

    #长文用句号分隔，短文用逗号分隔
    ss = re.split('(。|！|\!|？|\?|；)',str(text)) if len(text) > 249 else re.split('(；|。|，|！|\!|？|\?)',str(text))
    if len(text) > 249 and len(ss) < 2: ss = re.split('(，)',str(text))
    sentences = [i for i in ss if len(i) > 5]

    stopwords = get_stop(2)
    words = jieba.cut(text)
    #词频
    word2count = {}
    for word in words:
        if word in stopwords:continue
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

    for key in word2count.keys():
        word2count[key] = word2count[key] / max(word2count.values())

    #计算句子得分
    sent2score = {}
    for sentence in sentences:
        for word in jieba.cut(sentence):
            if word in word2count.keys():
                if len(sentence)<300:
                    if sentence not in sent2score.keys():
                        sent2score[sentence] = word2count[word]
                    else:
                        sent2score[sentence] += word2count[word]
    #句子排序，词频大于5
    summ = key_sens(sent2score,5)

    return summ

ok，新闻标题、摘要、全文都有了，下一篇讲Flask展现新闻，敬请关注。