【提取新闻主要内容之二】获取一定时间内的新闻列表并关联新闻内容（使用正则表达式）

最新推荐文章于 2021-02-03 08:49:45 发布

随笔备忘录

最新推荐文章于 2021-02-03 08:49:45 发布

阅读量1k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_42199542/article/details/105518161

版权

Python 专栏收录该内容

39 篇文章 7 订阅

订阅专栏

具体新闻网页的内容提取，在上一篇文章中：【提取新闻主要内容之一】从具体的新闻网页中提取标题和作者信息

【主要程序】

【可能遇见的问题】

好了，让我们进入程序正题吧！

【主要程序】

第一步：还是引入所需模块，与上节没有区别：

from requests import request
from bs4 import BeautifulSoup
import time
import re

第二步：将获取网页的href内容单独写成一个函数，因为后面需要使用，因此先声名函数acquirehref(href)：

#get Href from web
def acquirehref(href):
    return request('GET', href).text

第三步：在获取新闻的主要内容说明时，总会碰到格式的调整，为方便后续使用，先声名格式调整函数format(text)：

def format(text):
    while(text.find("  ")>=0):
        text=text.replace("  "," ")
    while(text.find("\n\n")>=0):
        text=text.replace("\n\n","\n")
    return text

第四步：以下为本程序的主要内容：

1.在以下代码中，以只读形式打开文件hrefs.txt；

2.获取新闻网站主页的href信息；

3.针对新闻首页的每一条信息新闻，从首页获取具体新闻的跳转链接，并拼接上新闻主页，使之成为可打开的子新闻跳

转链接，以便后续程序可进入子新闻网页，根据上一篇的代码提取新闻的标题、作者和内容简介。

4.将提取的内容写入相关文件，最后关闭文件。

for j in range(0,24*60*60):
    h=open('./hrefs.txt','r')
    h.close()
    htmltext=acquirehref( 'https://russian.rt.com/trend/335110-ssha')
    BeautS=BeautifulSoup(htmltext,'lxml')
    BeautS.encoding = 'utf-8' 
    itemsnews=BeautS.findAll('div',{'class':'card__heading card__heading_all-new'})
    #print(itemsnews)
    h=open('./hrefs.txt','a+')
    res=open('./newsinfo.txt','a+')
    total=len(hrefs)
    #print(itemsnews)
    for itemnews in itemsnews:
        href=BeautifulSoup(str(itemnews),'lxml')
        href=href.find('a') 
        link=href.get('href')
        if(link.find('https')==-1):
            link='https://russian.rt.com/'+link
        if not (link in itemnews):
            h.write(link+'\n')
            if (link.find('https://russian.rt.com/'))>=0:
                BeautS=BeautifulSoup(acquirehref(link),'lxml')
                BeautS.encoding = 'utf-8' 
                for tag in BeautS.find_all('div', class_='article article_article-page'):  
                    title = tag.find('h1',class_="article__heading article__heading_article-page").get_text()
                    title=format(title.lower())
                #author=BeautS.find('div', {'class':'author-name'}).text if BeautS.find('div', {'class':'author-name'})!=None else""
                for tag in BeautS.find_all('div', class_='article article_article-page'):  
                    abstract = tag.find('div',class_="article__summary article__summary_article-page js-mediator-article").get_text()
                    abstract=format(abstract.lower())
                for tag in BeautS.find_all('div', class_='article__date-autor-shortcode article__date-author-shortcode_article-page'):  
                    author = tag.find('div',class_="article__author article__author_article-page article__author_with-label").get_text()if tag.find('div',class_="article__author article__author_article-page article__author_with-label").get_text()!=None else""
                    author=format(author.lower())
                total+=1
                res=open('./newsinfo.txt','a+', encoding='utf-8')
                print('\n'+'This is '+str(total)+' news.The title is: '+title+'. The author is:'+author+'. The abastract is: '+abstract)
                res.write('\n'+'This is '+str(total)+' news.The title is: '+title+'. The author is:'+author+'. The abastract is: '+abstract)
    res.close()
    h.close()
    time.sleep(20)