新闻文本挖掘Text4NLPKG_nlp 文本话题挖掘-CSDN博客

本文链接：https://blog.csdn.net/qq_44503997/article/details/128682033

1.前言

在图像爬虫的章节，我们讨论了如何从几大搜索引擎获取自己所需要的图像数据，建议大家先阅读实践上文的图像爬虫和网页解析。本次，我们讨论，如何抽取文本数据，为NLP(自然语言处理)和KG(知识图谱)提供现实的数据来源。

2.代码

import requests
import json
import time
from lxml import etree
from pprint import pprint
from newspaper import Article

class NanFangWangSpider(object):
    def __init__(self,choosed_theme,):
        """
            # 要闻url 100页
            # 党建url 多个专栏->page页面
            # 广东 50页
            # 中国 50页
            # 国际 50页
            # 经济 多个专栏->page页面
            # 直播 多个专栏->page页面
        :param choosed_theme:
        """
        self.choice=choosed_theme
        self.headers={
            "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
            "cookie":"Hm_lvt_fcda14e8d9fc166be9cf6caef393ad0e=1673622568; wdcid=5e0e9eb472d9d315; wdses=2d7d89c6cfc148e5; southcncms_session=LhP9G6wO6V4ArF4gQX2i0gVOzmpscMPcI91yPyqB; southcncmssite_session=qBFC6YYtnxgMkZEuh2hxaGB6x5Funi9SXEcYP3Rj; wdlast=1673624787; Hm_lpvt_fcda14e8d9fc166be9cf6caef393ad0e=1673624788",
            "referer":"https://news.southcn.com/"
        }

        if self.choice=="要闻":
            self.base_url="https://www.southcn.com/node_b5769d65fb"
        elif self.choice=="党建":
            self.base_url="https://gddj.southcn.com/"
        elif self.choice=="广东":
            self.base_url="https://news.southcn.com/node_54a44f01a2"
        elif self.choice=="中国":
            self.base_url="https://news.southcn.com/node_179d29f1ce"
        elif self.choice=="国际":
            self.base_url="https://news.southcn.com/node_742c1a40f1"
        elif self.choice=="经济":
            self.base_url="https://economy.southcn.com/"
        elif self.choice=="直播":
            self.base_url="https://live.southcn.com/"
        else:
            print(f"{self.choice} is not correct,please choose one of 要闻、党建、广东、中国、国际、经济、直播")


    def getTopicUrl(self):
        response=requests.get(self.base_url,headers=self.headers)
        response.encoding = 'utf-8'
        html_text=response.text
        html = etree.HTML(html_text)
        url_list = list(set(html.xpath('//div[img[contains(@src,"png")]]/a/@href')))
        return url_list

    def getTextUrl(self):
        topic={}
        news=[]
        for url in self.topic_url:
            # 首页
            # time.sleep(10)
            response = requests.get(url, headers=self.headers)
            response.encoding = 'utf-8'
            print(f"正在获取首页内容.............")
            html_text = response.text
            # print(html_text)
            html=etree.HTML(html_text)
            topic_name=html.xpath('/html//title/text()')[0]
            page_num = html.xpath('/html//a[contains(@href,"page") and contains(@class,"page")]/text()')
            page = int(page_num[-2])
            topic["topic"]=topic_name
            print(topic)
            topic["page"]=page
            url_li = html.xpath('//div[contains(@class,"pw")]/h3/a/@href')
            names = html.xpath('//div[contains(@class,"pw")]/h3/a/text()')
            news.append(dict(zip(names,url_li)))
            # 获取剩余的页面
            for i in range(2,page+1):
                # time.sleep(10)
                total_url=url+f"?cms_node_post_list_page={i}"
                print(f"正在获取{total_url}内容.............")
                response2 = requests.get(total_url, headers=self.headers)
                response2.encoding = 'utf-8'
                html_text2 = response2.text
                # print(html_text)
                html2 = etree.HTML(html_text2)
                url_li = html.xpath('//div[contains(@class,"pw")]/h3/a/@href')
                names = html.xpath('//div[contains(@class,"pw")]/h3/a/text()')
                news.append(dict(zip(names,url_li)))
        topic["news"]=news
        return topic

    def Save2Json(self,content):
        json_str = json.dumps(content,ensure_ascii=False)
        with open("../data/content.json", "w") as json_file:
            json_file.write(json_str)
            print("Save json file successfully")

    def Save2Txt(self,filename,data):
        with open(f"./data/{filename}.txt","a+",encoding="utf-8") as f:
            f.write(data)

    def getTextContent(self):
        with open("./data/content.json", "r", encoding="utf-8") as load_f:
            content = json.load(load_f)
        pprint(content)

        # {new:[{10个内容},{}]}
        text = content["news"]
        for dicts in text:
            print(dicts)
            print(len(dicts))
            for key in dicts:
                url=dicts[key]
                news = Article(url, language='zh')
                news.download()  # 加载网页
                news.parse()  # 解析网页
                print(f"正在获取{url}内容...................")
                print('题目：\n', news.title)
                # print('正文：\n', news.text)
                write_data=f"{news.title}"+"\n"
                total_data=write_data+f"{news.text}"
                file_name=self.choice+"_"+content["topic"]+"_"+key
                self.Save2Txt(file_name,total_data)
                print("保存为txt文件")
                



if __name__ == '__main__':
    choosed_theme="经济"
    newSpider=NanFangWangSpider(choosed_theme)
    res=newSpider.getTopicUrl()
    print(res)
    print(len(res))
    re1=newSpider.getTextUrl()
    print(re1)
    newSpider.Save2Json(re1)
    newSpider.getTextContent()

以上的代码，是针对南方网的经济部分做经济新闻文本抽取，其他的模块也和本代码差不多。简要介绍，newsSpdier是类对象的实例化对象，其中有五类方法。

这里的网站的布局逻辑：

在南方网，选择“经济”这个模块；
我们看到经济模块有五个子模块：经济新闻、粤港澳大湾区、产业动态、科创专区和融媒报到；
点击其中一个子模块，获取新闻列表，新闻子模块名、新闻title和新闻url；
点击其中一个新闻，获取新闻的文本数据；

我们针对网站的特点，设计我们的爬虫逻辑：

获取子模块的新闻url列表；
保存为json文件
从json文件的url获取对应的新闻文本数据。

3. 网络爬虫的解析

上一次，我们对网站的解析，进行了详细的描述。再一次，我们简要介绍。

3.1 首页解析

首页解析

我们打开谷歌浏览器，点击经济模块，看到以上的界面。右击鼠标，点击检查，就可以看到网页的源码。和上一次一样，我们查看经济新闻后面的按钮，出现以下代码。
网页响应

点击其他的子模块，我们看到一个规律，div标签下有两个子标签：img和a的标签。
第一个子模块
子模块2
子模块3
子模块4
子模块5

我们看到以上的规律之后，采用xpath去匹配，获取我们需要子模块的url列表，那么如何获取url呢？

url_list = list(set(html.xpath('//div[img[contains(@src,"png")]]/a/@href')))

5个子模块的url结果如下：

['https://economy.southcn.com/node_71505a4d28', 'https://economy.southcn.com/node_52de6dd46e', 'https://economy.southcn.com/node_b416e11004', 'https://economy.southcn.com/node_d5a62a3867', 'https://economy.southcn.com/node_17f55ea8c0']

3.2 子模块的解析

我们随意打开一个子模块，比如产业动态的新闻。
子模块界面

看到产业新闻的页面，还没有进入具体新闻的文本界面。这里我们需要获取每个新闻的url列表。我们点击下面的下一页，得到规律：

# 首页地址
https://economy.southcn.com/node_b416e11004
# 第二页地址
https://economy.southcn.com/node_b416e11004?cms_node_post_list_page=2
# 第三页地址
https://economy.southcn.com/node_b416e11004?cms_node_post_list_page=3
.
.
.
# 第十页地址
https://economy.southcn.com/node_b416e11004?cms_node_post_list_page=10

规律是这样的，首页地址https://economy.southcn.com/node_b416e11004，后续的地址只是在首页地址的基础上，加上?cms_node_post_list_page=第几页数。由此，获得了子模块的首页网址。

为此，我们需要获取子模块的网页数。我们看到的网页数为10。为动态获取网页数，我们依然采用xpath解析。

xpath的匹配方法如下：

page_num=html1.xpath('/html//a[contains(@href,"page") and contains(@class,"page")]/text()')
page_num=html1.xpath('/html//a[contains(@href,"page") and contains(@class,"page")]/text()')
print(page_num)
page=int(page_num[-2])
print(page)

查看结果：
获取网页数

3.3 获取子模块的一个页面的新闻url

我们查看多个新闻的url的源码，看到如下的规律：
每个新闻页面的url和title
具体的url和title
xpath匹配代码如下：

url=html1.xpath('//div[contains(@class,"pw")]/h3/a/@href')
name=html1.xpath('//div[contains(@class,"pw")]/h3/a/text()')
print(url)
print(name)
print(len(name),len(url))
print(dict(zip(name,url)))

url和title结果

3.4 获取单个url的文本内容

在3.3中，我们获取了南方网-经济-产业动态的每个url，代开其中一个，比如以下的网页：

'深圳MiniLED团体标准上榜工信部“2022年团体标准应用示范项目名单”': 'https://economy.southcn.com/node_b416e11004/c8922776ff.shtml'

具体的url内容
获取文本内容：

news = Article(url, language='zh')
news.download()  # 加载网页
news.parse()  # 解析网页
print(f"正在获取{url}内容...................")
print('题目：\n', news.title)
print('正文：\n', news.text)

保存文本内容：

def Save2Txt(self,filename,data):
    with open(f"./data/{filename}.txt","a+",encoding="utf-8") as f:
        f.write(data)

保存文件
具体的新闻文本

4. 后续

创作不易，请多多关照，谢谢。后续，我们针对文本数据，进行信息抽取和知识图谱构建。