Python爬虫——爬取学【习】文章

最新推荐文章于 2024-07-12 19:06:42 发布

zebra°

最新推荐文章于 2024-07-12 19:06:42 发布

阅读量139

点赞数

分类专栏：爬虫个人学习笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_43895522/article/details/121082754

版权

爬虫个人学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

首先导入需要的模块，有os模块用来创建文件夹，time用来暂停爬虫程序，datetime用来处理日期，requests用来获取网页信息，beautifulsoup用来解析网页，docx用来创建word文档，把爬取到的文章写入并存在本地磁盘。

#导入所需库######################
import os
import time
import datetime
import requests
from bs4 import BeautifulSoup
import docx
from docx.enum.text import WD_ALIGN_PARAGRAPH,WD_LINE_SPACING#设置对齐方式
from docx.shared import Inches,Pt#设置字体大小
from docx.oxml.ns import qn  #设置字体

设置爬虫请求头，用来伪装程序：

#设置爬虫请求头###################
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

设置爬取的起止日期，根据需要也可使用input进行命令行手动输入，在这里我直接写进脚本里。

#设置爬取的起止日期################
begin = datetime.date(2021,9,25)#在此输入开始日期
end = datetime.date(2021,10,31) #在此输入结束日期
delta = datetime.timedelta(days=1)#设置一个一天的时间差

写一个循环，进入爬虫的主体部分。首先分析网址url的结构特点，运用日期将网址构建出来，用于循环爬取，将爬取到的文章标题进行关键字判断，如果包含所需内容，就进入二级网页将文章的全部段落内容获取下来，再使用docx模块相关功能，经格式设置之后写入本地磁盘。

while begin <= end:
    ftime = begin.strftime("%Y-%m-%d").split('-')   
    year,month,day = ftime#提取开始日期的年、月、日

    monthdir = 'F:\\81xi\\' + year + '-' + month#按年月设置文档保存路径
    if os.path.isdir(monthdir) != True:#如果文档保存路径不存在，就创建
        os.makedirs(monthdir)   
    
    furl = 'http://www.81.cn/jfjbmap/content/' + year + '-' + month + '/' + day + '/node_2.htm'#解放军报url
    fre = requests.get(furl,headers = HEADERS)#获取解放军报网页
    fre.encoding = 'utf-8'#设置网页编码格式
    fsoup= BeautifulSoup(fre.text,'lxml')#使用bs解析网页
    lilist = fsoup.find('ul',id = 'APP-NewsList').find_all('li')#获取文章标签列表

    for li in lilist:#遍历文章标签列表
        newurl = 'http://www.81.cn/jfjbmap/content/' + year + '-' + month + '/' + day + '/' + li.a.get('href')#文章url
        newre = requests.get(newurl,headers = HEADERS)#获取文章网页
        newre.encoding = 'utf-8'#设置网页编码格式
        newsoup= BeautifulSoup(newre.text,'lxml')#使用bs解析网页
        
        newtitle = newsoup.find('div',class_ = 'article-header').children#获取文章标题标签列表        
        title = ''
        for child in newtitle:#遍历文章标题标签列表，将标题拼接起来
            if child.string != None:
                title = title + child.string +'\n'
                title = title.replace('\n\n稿件信息\n\n\n','')


        if 'xixixi' in title or 'xixixi' in title:#设置标题过滤关键字
            if '致电' in title or '致贺电' in title or '致贺信' in title:#过滤部分
                continue
            
            file = docx.Document()#创建文档
            head = file.add_heading(title + '\n',1)#写入标题
            head.alignment = WD_ALIGN_PARAGRAPH.CENTER#设置标题居中
            
            article = newsoup.find('div',class_ = 'article-content')#获取文章标签
            plist = article.find_all('p')#获取段落标签列表
            for p in plist:#遍历段落列表
                para = p.string#提取段落文本
                pbody = file.add_paragraph()#创建段落
                pbody.paragraph_format.first_line_indent = 406400#设置段落首行缩进两字符                
                text = pbody.add_run(para)#添加段落文本内容
                text.font.name = 'Times New Roman'#设置段落西文字体
                text.element.rPr.rFonts.set(qn('w:eastAsia'), '黑体')#设置段落中文字体
            #设置文档名
            savetitle = '[' + str(d) + ']' + title.replace('\n','').replace('■本报评论员','').strip().replace(' ','-') + '.docx'
            #将文档保存
            file.save(monthdir + '\\' + savetitle)

    print(str(begin) + '-over!')#每下载一天，就提示下载完毕！
    time.sleep(0.3)#每下载一天暂停0.3s
    begin += delta#增加一天，进入下一个循环

下面是爬虫程序运行中的输出提示：
在这里插入图片描述
下面是按月份保存在本地磁盘里的文章：

文章具体情况截图就不放了，含有政治敏感词，审核不通过。

zebra°

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫——爬取学【习】文章

首先导入需要的模块，有os模块用来创建文件夹，time用来暂停爬虫程序，datetime用来处理日期，requests用来获取网页信息，beautifulsoup用来解析网页，docx用来创建word文档，把爬取到的文章写入并存在本地磁盘。#导入所需库######################import osimport timeimport datetimeimport requestsfrom bs4 import BeautifulSoupimport docxfrom docx.
复制链接

扫一扫