scrapy 解决中途中断爬取问题

最新推荐文章于 2021-10-26 20:49:48 发布

一小小辣椒

最新推荐文章于 2021-10-26 20:49:48 发布

阅读量1.2k

点赞数 1

分类专栏：爬虫 scrapy 文章标签： python 大数据

本文链接：https://blog.csdn.net/weixin_40018318/article/details/116009219

版权

参考代码：爬取政府招标信息政府招标思路：爬虫文件开启时本地记录相关爬取信息，下次开启时判断本地记录的信息，跳过已记录的相关内容爬取def get_erveday(): begin_date = datetime.date(2021,4,1).strftime("%Y-%m-%d") date_list = [] begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d") end_date = datet

摘要由CSDN通过智能技术生成

参考代码：爬取政府招标信息
政府招标
思路：爬虫文件开启时本地记录相关爬取信息，下次开启时判断本地记录的信息，跳过已记录的相关内容爬取

def get_erveday():
    begin_date = datetime.date(2021,4,1).strftime("%Y-%m-%d")
    date_list = []
    begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")
    end_date = datetime.date(2021,4,17).strftime("%Y-%m-%d")
    # end_date = datetime.datetime.strptime(time.strftime(datetime.datetime.now().strftime("%Y-%m-%d")), "%Y-%m-%d")
    end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d")
    while begin_date <= end_date:
        date_str = begin_date.strftime("%Y-%m-%d")
        date_list.append(date_str)
        begin_date += datetime.timedelta(days=1)
    return date_list

在爬虫文件上写上这都代码，生成一个时间段列表，爬取该时间段内的所有内容。

def get_over_data():
    with open("data.txt", "r", encoding="utf-8") as ft:
        result = ft.read()
        return result

从本地读取记录信息，生成结果做条件判断
在这里插入图片描述
爬取一段内容作一次记录

具体思路就是这样的，更好的办法是结合redis数据库，当然还有其他思路，欢迎交流！
完整代码：

import scrapy
import json
from urllib import parse
from bs4 import BeautifulSoup
import copy
HEA = {
   "Content-Type":"application/json"}
data_dict = {
   "采购公告": 'ZcyAnnouncement1', "结果公告": "ZcyAnnouncement2", "合同公告": "ZcyAnnouncement3",
                 "更正公告": "ZcyAnnouncement4", "招标文件预公示":

最低0.47元/天解锁文章

一小小辣椒

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
scrapy 解决中途中断爬取问题

参考代码：爬取政府招标信息政府招标思路：爬虫文件开启时本地记录相关爬取信息，下次开启时判断本地记录的信息，跳过已记录的相关内容爬取def get_erveday(): begin_date = datetime.date(2021,4,1).strftime("%Y-%m-%d") date_list = [] begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d") end_date = datet
复制链接

扫一扫