参考代码:爬取政府招标信息
政府招标
思路:爬虫文件开启时本地记录相关爬取信息,下次开启时判断本地记录的信息,跳过已记录的相关内容爬取
def get_erveday():
begin_date = datetime.date(2021,4,1).strftime("%Y-%m-%d")
date_list = []
begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")
end_date = datetime.date(2021,4,17).strftime("%Y-%m-%d")
# end_date = datetime.datetime.strptime(time.strftime(datetime.datetime.now().strftime("%Y-%m-%d")), "%Y-%m-%d")
end_date = datetime.datetime.strptime(end_date, "%Y-%m-%d")
while begin_date <= end_date:
date_str = begin_date.strftime("%Y-%m-%d")
date_list.append(date_str)
begin_date += datetime.timedelta(days=1)
return date_list
在爬虫文件上写上这都代码,生成一个时间段列表,爬取该时间段内的所有内容。
def get_over_data():
with open("data.txt", "r", encoding="utf-8") as ft:
result = ft.read()
return result
从本地读取记录信息,生成结果做条件判断
爬取一段内容作一次记录
具体思路就是这样的,更好的办法是结合redis数据库,当然还有其他思路,欢迎交流!
完整代码:
import scrapy
import json
from urllib import parse
from bs4 import BeautifulSoup
import copy
HEA = {
"Content-Type":"application/json"}
data_dict = {
"采购公告": 'ZcyAnnouncement1', "结果公告": "ZcyAnnouncement2", "合同公告": "ZcyAnnouncement3",
"更正公告": "ZcyAnnouncement4", "招标文件预公示":