常见的爬虫都是采集文本数据,如果待采集的是很多个文件,如何批量下载?
今天我们以巨潮资讯网 http://www.cninfo.com.cn 为例子
在实战前先总结一下爬虫的访问方法
90%的爬虫大都是requests.get
剩下的10%是requests.post
访问方法的确定,查看开发者工具Network面板里对应url里的Request Method
本教程中的访问方法是POST,所以用到requests.post函数。
requests.post详解
针对requests.post,需要用到一个params参数
即requests.post(url, params='字典数据类型')
url是post网址对象
params是为了构造完整的url 形如
import requests
url = 'http://www.cninfo.com.cn/new/disclosure'
data = {'column': 'szse_latest',
'pageNum': 4,
'pageSize': 20,
'sortName': '',
'sortType':'',
'clusterFlag': 'true'}
resp = requests.post(url, params=data)
resp.url
视频教程
视频我已经上传到B站【python网络爬虫快速入门】中,
视频链接 https://www.bilibili.com/video/av72010301?p=10
也可点击文末 “阅读原文”跳转爬虫视频链接
代码
import requests
import csv
#下载pdf公告的函数
def downloadpdf(url, file):
resp = requests.get(url)
f = open(file, 'wb')
f.write(resp.content)
f.close()
#新建csv文件,存储公告详细信息
csvf = open('data/巨潮资讯/深圳证券市场公告.csv', 'a+', encoding='gbk', newline='')
writer = csv.writer(csvf)
writer.writerow(('公司名', '股票代码', '发布时间', '公告标题', '公告pdf下载地址', '公告类型'))
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
cookies = {'Cookie': 'noticeTabClicks=%7B%22szse%22%3A1%2C%22sse%22%3A0%2C%22hot%22%3A0%2C%22myNotice%22%3A0%7D; tradeTabClicks=%7B%22financing%20%22%3A0%2C%22restricted%20%22%3A0%2C%22blocktrade%22%3A0%2C%22myMarket%22%3A0%2C%22financing%22%3Anull%7D; JSESSIONID=183467B85157E00A626B77D1E16CC580; insert_cookie=45380249; UC-JSESSIONID=A75421EA72188528B984B4166A86CAAA; _sp_ses.2141=*; _sp_id.2141=9063055c-7fc7-4b0c-a0ed-089886082fbd.1579084693.2.1579318110.1579084713.3f22b0aa-580d-4f25-9127-9734dd5647dc'}
#post对应的网址
url = 'http://www.cninfo.com.cn/new/disclosure'
for page in range(39):
try:
#post请求构造参数
data = {'column': 'szse_latest',
'pageNum': page,
'pageSize': 20,
'sortName': '',
'sortType':'',
'clusterFlag': 'true'}
#发起请求,采集
resp = requests.post(url, params=data, headers=headers, cookies=cookies)
pdfss = resp.json()['classifiedAnnouncements']
print(page)
for pdfs in pdfss:
for pdf in pdfs:
secName = pdf['secName']
secCode = 'SZ'+str(pdf['secCode'])
announcementTitle = pdf['announcementTitle']
adjunctUrl = 'http://static.cninfo.com.cn/'+pdf['adjunctUrl']
pdffile = 'data/巨潮资讯/pdf/'+announcementTitle+'.pdf'
downloadpdf(url=adjunctUrl, file=pdffile)
announcementTypeName = pdf['announcementTypeName']
announcementTime = pdf['announcementTime']
#print(secName, secCode, announcementTime, announcementTitle, adjunctUrl, announcementTypeName)
writer.writerow((secName, secCode, announcementTime, announcementTitle, adjunctUrl, announcementTypeName))
except:
print('出问题的网址', resp.url)
csvf.close()
近期文章
jupyter notebook代码获取方式,公众号后台回复关键词“20200120”