利用爬虫下载上交所和深交所年报并分析委托贷款情况

最新推荐文章于 2024-05-09 15:14:00 发布

Tomato uncle 彭煜方�

最新推荐文章于 2024-05-09 15:14:00 发布

阅读量2.7k

点赞数 3

文章标签： Python爬虫

本文链接：https://blog.csdn.net/great_cricus/article/details/102463138

版权

最近应刘老板要求，写了一个程序帮她分析委托贷款情况。第一部分是如何用Python爬虫下载上交所深交所年报，第二部分是讲怎么对爬取下来的年报PDF做处理

爬取年报

上交所

首先上交所公告信息的链接如下，大家在网上很容易找到
上交所上市公司信息
打开页面如下
在这里插入图片描述
我想到的策略是先得到所有股票代码，然后根据代码模拟查询一定年份之内的年报。难点在于如何模拟查询。我们可以打开在Chrome浏览器中按F12键

我们先在查询窗口输入查询条件：600000-主板-年报，然后点击查询。在右边的界面中我们一个一个点击，可以发现最后一个（通常来说是最后一个）包含了我们的查询信息
在这里插入图片描述
大家可以看到，因为网站的request类型是get的，我们直接往params字典里填东西就好了，经过一通研究，我们发现我们只用填部分参数

URL_PARAM = {
    # 是否分页
    'isPagination': 'false',
    # 证券代码
    'productId': '600000',
    # 关键字
    'keyWord': '',
    # 主板还是科创板, 主板是0101
    'securityType': "0101",
    # 下面两个reportType代表报告类型，我们要的是年报
    'reportType2': 'DQBG',
    'reportType': 'YEARLY',
    # 查询时间
    'beginDate': '2016-07-17',
    'endDate': '2019-09-25',
}

我们还观察到网站有反爬虫措施，他的refer是
http://www.sse.com.cn/disclosure/listedinfo/announcement/
所以我们设置

HEADER = {
    'Referer': URL_SSE,
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
}
URL_SSE = "http://www.sse.com.cn/disclosure/listedinfo/announcement/"

之后就是常规操作了，可见我的代码

import os
import time
import requests
from copy import deepcopy

URL_SSE = "http://www.sse.com.cn/disclosure/listedinfo/announcement/"
# 股票
URL_SSE_STOCK = "http://www.sse.com.cn/js/common/ssesuggestdata.js"
# 查询
URL_QUERY_COMPANY = "http://query.sse.com.cn/security/stock/queryCompanyBulletin.do"

URL_PDF = "http://static.sse.com.cn"

# 报告类型
REPORT_TYPE = {
    '全部': ('ALL', ''),
    '定期公告': ('ALL', 'DQBG'),
    '年报': ('YEARLY', 'DQBG'),
    '第一季度季报': ('QUATER1', 'DQBG'),
    '半年报': ('QUATER2', 'DQBG'),
    '第三季度季报': ('QUATER3', 'DQBG'),
    '临时公告': ('ALL', 'LSGG'),
    '上市公司章程': ('SHGSZC', 'LSGG'),
    '发行上市公告': ('FXSSGG', 'LSGG'),
    '公司治理': ('GSZL', 'LSGG'),
    '股东大会会议': ('GDDH', 'LSGG'),
    'IPO公司公告': ('IPOGG', 'LSGG'),
    '其他': ('QT', 'LSGG'),
}

# 证券类型
SECURITY_TYPE = {
    '全部': '0101,120100,020100,020200,120200',
    '主板': '0101',
    '科创板': '120100,020100,020200,120200',
}

HEADER = {
    'Referer': URL_SSE,
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
}

URL_PARAM = {
    # 是否分页
    'isPagination': 'false',
    'productId': '600000',
    # 关键字
    'keyWord': '',
    'securityType': SECURITY_TYPE['主板'],
    'reportType2': 'DQBG',
    'reportType': 'YEARLY',
    'beginDate': '2016-07-17',
    'endDate': '2019-09-25',
}


def get_all_codes(url):
    res = requests.get(url)
    content = res.content.decode()
    tmp = content.split('_t.push({val:"')
    code, name, pinyin = [], [], []
    for i in tmp[1:]:
        item = i.split('"')
        code.append(item[0])
        name.append(item[2])
        pinyin.append(item[4])
    return code, name, pinyin


def get_pdf_url(code, begin_date, end_date, security_type='全部', report_type='年报'):
    url_param = deepcopy(URL_PARAM)
    url_param['productId'] = code
    url_param['securityType'] = SECURITY_TYPE[security_type]
    url_param['reportType2'] = REPORT_TYPE[report_type][1]
    url_param['reportType'] = REPORT_TYPE[report_type][0]
    url_param['beginDate'] = begin_date
    url_param['endDate'] = end_date
    result = requests.get(URL_QUERY_COMPANY, params=url_param, headers=HEADER).json()['result']
    return_list = []
    for i in result:
        if "摘要" in i["TITLE"]:
            pass
        else:
            return_list.append((URL_PDF + i['URL'], i['BULLETIN_TYPE'], i['BULLETIN_YEAR'], i['SSEDATE']))
    return return_list

def save_pdf(code, pdf_title_urls, path='./SH/'):
    file_path = os.path.join(path, code)
    if not os.path.isdir(file_path):
        os.makedirs(file_path)
    for url, r_type, year, date in pdf_title_urls:
        date = ''.join(date.split('-'))
        file_name = '_'.join([code, r_type, year, date]) + '.pdf'
        file_full_name = os.path.join(file_path, file_name)
        rs = requests.get(url, stream=True)
        with open(file_full_name, "wb") as fp:
            for chunk in rs.iter_content(chunk_size=10240):
                if chunk:
                    fp.write(chunk)


def download_report(code):
    month_day = time.strftime('-%m-%d', time.localtime())
    year = int(time.strftime('%Y', time.localtime()))
    while True:
        year_3 = year - 5
        begin_date = str(year_3) + month_day
        end_date = str(year) + month_day
        pdf_urls = get_pdf_url(code, begin_date, end_date)
        if pdf_urls:
            for i in range(1, 6):
                try:
                    save_pdf(code, pdf_urls)
                    break
                except Exception:
                    print('[{%s}] 第{%d}次尝试下载出错'%(code, i))
            else:
                print('[{%s}] 下载失败'%code)
        else:
            print('[{%s}] 完毕'%code)
            break
        break


def main():
    stock_codes, _, _ = get_all_codes(URL_SSE_STOCK)
    len_stock_codes = len(stock_codes)
    already_list = os.listdir("./SH")
    for index, code in enumerate(stock_codes):
        print('股票总数:{%d}, 已完成:{%d}  '%(len_stock_codes, index), end='')
        if code in already_list:
            pass
        else:
            download_report(code)
    print('任务完成')
#
#
if __name__ == '__main__':
    result = requests.get(URL_QUERY_COMPANY, params=URL_PARAM, headers=HEADER)
    main()

Tomato uncle 彭煜方�

关注

3
点赞
踩
21

收藏

觉得还不错? 一键收藏
1
评论
利用爬虫下载上交所和深交所年报并分析委托贷款情况

最近一个应刘老板要求，写了一个程序帮她分析委托贷款情况。第一部分是如何用Python爬虫下载上交所深交所年报，第二部分是讲怎么对爬取下来的年报PDF做处理爬取年报上交所首先上交所公告信息的链接如下，大家在网上很容易找到上交所上市公司信息打开页面如下我想到的策略是先得到所有股票代码，然后根据代码模拟查询一定年份之内的年报。难点在于如何模拟查询。我们可以打开在Chrome浏览器中按F12...
复制链接

扫一扫