Python爬虫爬取保监会披露信息

最新推荐文章于 2024-09-14 08:25:16 发布

Void-:)

最新推荐文章于 2024-09-14 08:25:16 发布

阅读量2.8k

点赞数 3

分类专栏： python随笔文章标签： python http

本文链接：https://blog.csdn.net/weixin_42874091/article/details/105023501

版权

python随笔专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python爬虫爬取保险业协会披露信息

由于导师要求，需要下载保险业协会官网-信息披露中相关科目保险公司披露的pdf。

中国保险行业协会

点击不同的科目，如保险公司年度信息披露，我们发现页面的url并没有发生变化。这时我们应该合理怀疑页面采取了某些异步(XHR/JS)的方式。
打开开发者工具，在Network中勾选ALL，清空后点击我们要选取的科目，如关联交易合并披露。
可以看到红框中标示出来的是我们所需的真正的url。
http://icid.iachina.cn/ICID/front/leafColComType.do?columnid=2016072012158397
在这里插入图片描述
我们进入此url

下面将获取以上所有保险公司的pdf，需要对各保险公司进行遍历吗，同时每家保险公司都会披露多年的数据。

我们首先点击第一家保险公司：中国人寿资产管理有限公司，同样我们发现url并没有发生变化。我们按上述操作观察它的XHR，发现真实的url变成了
http://icid.iachina.cn/ICID/front/getCompanyInfos.do?columnid=2016072012158397&comCode=GSZC&attr=01
我们发现只需要在comCode这边赋值所有的保险公司简称即可。

下一步就是获取所有的保险公司的简称，我们在上一页面，即http://icid.iachina.cn/ICID/front/leafColComType.do?columnid=2016072012158397中，使用select an element去审查每家保险公司的名字，发现其简称都存储在控件a的id中。因此我们通过遍历的方式得到所有保险公司的简称，并带入到comCode中。
以国寿资产(GSZC)为例，我们得到在这里插入图片描述
其中每一个pdf就是我们最终想得到的结果了，我们点开一个公告，同样查看它的XHR
http://icid.iachina.cn/front/infoDetail.do?informationno=2020012109398975
接下来就是要获取informationno，它在刚刚那个页面控件为a的id处。

那么最后一步就是获取国寿资产这一年度公告的pdf了。
点开公告，我们可以看到url为http://icid.iachina.cn/ICID/files/piluxinxi/pdf/viewer.html?file=8f993c5a-1c1c-4f91-a8a5-7fad85a14616.PDF
file恰好也存在上一页面控件为a的id中。
需要注意的是，这边是viewer方式，我们只需要原始的pdf，因此改为http://icid.iachina.cn/ICID/files/piluxinxi/pdf/8f993c5a-1c1c-4f91-a8a5-7fad85a14616.PDF即可。

最终code整理如下：

from bs4 import BeautifulSoup
import requests
import time
from tqdm import tqdm
import os
header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0"}

url="http://icid.iachina.cn/front/leafColComType.do?columnid=2016072012158397"
response=requests.get(url,headers=header)
response.encoding='GBK'
soup=BeautifulSoup(response.text,'lxml')
data=soup.select('a')
n=[]
for i in data:
    try:
        n.append(i.attrs['id'])
    except:
        continue
for z in tqdm(n):

    url="http://icid.iachina.cn/front/getCompanyInfos.do?columnid=2016072012158397&comCode={}&attr=01#".format(z)
    response=requests.get(url,headers=header)
    response.encoding='GBK'
    soup=BeautifulSoup(response.text,'lxml')
    data=soup.select('a')
    l=[]
    name=[]
    for i in data:
        try:
            l.append(i.attrs['id'])
            name.append(i.text)
        except:
            continue
    l=l[:-1]
    name=name[:-1]
    for j in range(len(l)):
        url="http://icid.iachina.cn/front/infoDetail.do?informationno={}".format(l[j])
        response=requests.get(url,headers=header)
        response.encoding='GBK'
        soup=BeautifulSoup(response.text,'lxml')
        data=soup.select('a')
        link=data[1].attrs['id']

        url="http://icid.iachina.cn/files/piluxinxi/pdf/{}".format(link)
        response=requests.get(url,headers=header)
        pdf = response.content
        #写入pdf
        c=0
        with open(r"C:\Users\admin\Desktop\关联\auto\{}.pdf".format(name[j]),'wb') as f:
            f.write(pdf)
        while os.path.getsize(r'C:\Users\admin\Desktop\关联\auto\{}.pdf'.format(name[j]))==0:
            time.sleep(3)
            url="http://icid.iachina.cn/files/piluxinxi/pdf/{}".format(link)
            response=requests.get(url,headers=header)
            pdf = response.content
            with open(r"C:\Users\admin\Desktop\关联\auto\{}.pdf".format(name[j]),'wb') as f:
                f.write(pdf)  
            c+=1
            if c>=5:
                break