注意事项
需要安装一些包,如pdfminer、pdfminer3k、pdfplumber等;
pdfminer不能解析上交所问询函,使用解析功能更为强大的pdfplumber可以解析,但是内容上可能会出现个别字重复的现象;
pdfminer3k、pdfplumber可能存在不兼容问题导致程序无法运行,解析上交所年报用到pdfplumber,如果不能运行,根据提示看是否安装了pdfminer,或者尝试卸载pdfminer3k重新安装pdfplumber;
解析深交所年报用到pdfminer3k,如果安装了仍然不能运行,可尝试卸载pdfplumber,重新安装pdfminer3k;
如果要爬取所有问询函,需要根据实际页数更改代码中的循环参数。
上交所
第一步,爬取上交所问询函链接
# -*- coding: utf-8 -*-
# @Time : 2020/8/6 18:16
# @Author : 马拉小龙虾
# @FileName: 上交所一条龙.py
# @Software: PyCharm Community Edition
# @Blog :https://blog.csdn.net/weixin_43636302
import requests
import csv
import re
def downlourl(currentpage):
url = "http://query.sse.com.cn/commonSoaQuery.do?siteId=28&sqlId=BS_GGLL&extGGLX=&stockcode=&channelId=10743%2C10744%2C10012&extGGDL=&order=createTime%7Cdesc%2Cstockcode%7Casc&isPagination=true&pageHelp.pageSize=15&pageHelp.pageNo=" + repr(currentpage) + "&pageHelp.beginPage=" + repr(currentpage) +"&pageHelp.cacheSize=1"
return(url)
headers = {
'Referer':'http://www.sse.com.cn/disclosure/credibility/supervision/inquiries/',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
with open('sh.csv',"w",newline='') as f:
writer = csv.writer(f, delimiter=',')
title=['时间2','标题','公司代码','函件类别','公司简称','函件类型','时间1','网址','函件编码']
writer.writerow(title)
for page in range(1,101):
r = requests.get(downlourl(page), headers=headers)
for i in r.json()['result']:
result=re.search('c/(\d+).pdf',i['docURL'])
print(result.group(1))
# print(i['docURL'][-20:-4])
writer.writerow([i['cmsOpDate'],i['docTitle'],i['stockcode'],i['extWTFL'],i['extGSJC'],i['docType'],i['createTime'],i['docURL'],re.search('c/(\d+).pdf',i['docURL']).group(1)])
print('完成爬取第%d页'%page)
第二步,下载问询函并解析,保存到单个txt文件中
import pandas as pd
import time
from urllib.request import urlopen
from urllib.request import Request
from urllib.request import quote
import requests
import pdfplumber
import re
data = pd.read_csv("sh.csv",encoding='GBK')
函件编码 = data.loc[:,'函件编码']
网址 = data.loc[:,'网址']
函件类型 = data.loc[:,'函件类型']
headers = {'content-type': 'application/json',
'Accept-Encoding': 'gzip, deflate',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'}
baseurl = "http://reportdocs.static.szse.cn/UpFiles/fxklwxhj/"
def parse(docucode):