python pandas 格式化qcc数据【最后保存csv文件】

数据获取网站:https://top.tianyancha.com/companies/bj
因为是直接获取【工商信息板块】的HTML,所以需要本地解析HTML文档
这里数据处理所使用的模块:pandas,requests_html

在这里插入图片描述

1.HTML源码:

<div class="data-header" id="nav-main-baseInfo"><span class="data-title">工商信息</span><span class="tips-block-data" style="margin-left: 0px;">
        <!-- 标签:发生变更时通知我 -->
        <div class="listening-company-tag" tyc-event-click="true" tyc-event-ch="Company_Detail_Businessinfo_Change_CallMe">
            <i class="tic tic-tixing"></i><span>发生变更时通知我</span>
        </div>
    </span>
    <div class="data-logo"><svg viewBox="0 0 90 20">
            <use xlink:href="#svg-tyc-logo"></use>
        </svg></div><a class="view-official-photo" href="https://www.tianyancha.com/snapshot/1341875964" style="" target="_blank" tyc-event-click="" tyc-event-ch="Company_Detail_Businessinfo_Official_Photo"><i class="tic tic-kuaizhao"></i>查看工商快照</a>
    <!-- <span class="link-click  hong-kong-query  hidden" id="hongKongQuery"
                    tyc-event-click tyc-event-ch="Businfo_Depth_Information_Analysis_PC"
                    οnclick="comHongKong.hongKongQueryModal()"><span class="entrance-content">香港企业信息分析报告</span><i class="tic tic-bread-right-icon"></i></span>-->
</div>
<div class="data-content" id="_container_baseInfo">
    <!--entityType  ==1   公司 ,2香港,3社会组织,4律所 5事业单位 6基金会 8台湾-->
    <table class="table -striped-col -breakall">
        <tbody>
            <tr>
                <td rowspan="4" width="148px">法定代表人</td>
                <td rowspan="4" class="left-col  shadow" width="" tyc-event-click="" tyc-event-ch="CompangyDetail.faren">
                    <div class="legal-representative -new" style="min-height: 157px;" onclick="common.openUrl('https://www.tianyancha.com/human/2342810371-c1341875964')">
                        <div>
                            <div class="lazy-img  indetity-logo-w56   -alias -bg2">
                                <div class="logo-text -l1 -w56 -bg2"><span class="text"></span></div>
                                <div class="logo -w56" data-index=""><img class="img" data-src="" alt="张*" err-src=""></div>
                            </div>
                            <div class="humancompany">
                                <div class="name"><a class="link-click" target="_blank" title="张*" href="https://www.tianyancha.com/human/2342810371-c1341875964" onclick="common.stopPropagation(event)">张***</a></div>
                            </div>
                        </div>
                        <div class="merge -new"></div>
                    </div>
                </td>
                <td width="148px">经营状态</td>
                <td width="289px">存续</td>
                <td rowspan="3" width="147px">天眼评分</td>
                <td rowspan="3" width="168px" class="sort-bg -no-align -hover shadow -new">
                    <div class="sort-score  -new"><span class="sort-score-desc">评分</span><span class="sort-score-value">99</span></div>
                    <div class="sort-chart-container -new"><img class="sort-chart lazy-img -image" style="width: 139px;" alt="评分99" data-src="https://cdn.tianyancha.com/web-require-js/themes/18blue/images/score/score_99.png" src="https://cdn.tianyancha.com/web-require-js/themes/18blue/images/score/score_99.png"></div>
                    <div class="score-claim-hover -hover -new" onclick="claimGuide.goPackagePage(1341875964)" tyc-event-click="" tyc-event-ch="CompanySearch.grade.Jiafen">
                        <div class="angle -new"></div>
                        <div class="score-claim-content  unclaimed  "><span class="score-claim-text">99<i class="score-claim-text-add unclaimed">+1</i></span></div>
                        <div class="claim-score-contrast">加分后可超过约&nbsp;<em>205904219</em>&nbsp;家企业</div>
                        <div class="score-claim-click">天眼评分是客户了解企业实力直观的方式!<br>认证后可为自己的企业加分</div>
                        <div class="button button-claim -sm" tyc-event-click="" tyc-event-ch="CompanyDetail.Score.Renzheng">去认证</div>
                    </div>
                </td>
            </tr>
            <tr>
                <td>成立日期</td>
                <td title=" ">2015-08-25</td>
            </tr>
            <tr>
                <td width="">注册资本<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="width:304px;
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">注册资本,是指合营企业在登记管理机构登记的资本总额,是合营各方已经缴纳的或合营者承诺一定要缴纳的出资额的总和。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">注册资本大小体现企业的综合实力,注册资本越多说明企业实力越雄厚。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">为了扩大企业经营范围或拓宽业务等,公司会增加注册资本,增强企业实力,提高信用。<a class="link-click-more" href="https://www.tianyanqifu.com/goods-list?q=注册资本增资" target="_blank" sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册资本提示&amp;button=我要办理注册资本增资"><span class="link-text">我要办理注册资本增资</span><i class="tic tic-bread-right-icon"></i></a></div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">当出现市场不景气、企业资本过剩或严重亏损等问题时,企业会减少注册资本进行应对。<a class="link-click-more" href="https://www.tianyanqifu.com/goods-list?q=注册资本减资" target="_blank" sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册资本提示&amp;button=我要办理注册资本减资"><span class="link-text">我要办理注册资本减资</span><i class="tic tic-bread-right-icon"></i></a></div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册资本提示&amp;button=没有帮助" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册资本提示&amp;button=有帮助" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td width="">
                    <div title="5000000万人民币">5000000万人民币</div>
                </td>
            </tr>
            <tr>
                <td>实缴资本</td>
                <td width="">4320000万人民币</td>
                <td>工商注册号</td>
                <td>1100000*******</td>
            </tr>
            <tr>
                <td>统一社会信用代码<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">一般指法人和其他组织统一社会信用代码,相当于让法人和其他组织拥有了一个全国统一的“身份证号”。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">标准规定统一社会信用代码用18位阿拉伯数字或大写英文字母表示。</div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.CreditCodeExplainPopup.Unhelpful" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.CreditCodeExplainPopup.Helpful" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td>911100003*****</td>
                <td>纳税人识别号<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">纳税人识别号是税务登记证上的号码,通常简称为“税号”,每个企业的纳税人识别号都是唯一的。由15位、17位、18或者20位码(字符型)组成。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">这个属于每个人自己且终身不变的数字代码很可能成为我们的第二张“身份证”。</div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.TaxpayerNumberExplainPopup.Unhelpful" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.TaxpayerNumberExplainPopup.Helpful" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td>9111000035*******</td>
                <td>组织机构代码<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">组织机构代码是组织机构在社会经济活动中统一赋予的“单位身份证”,是对国内依法注册、登记的机关、企事业单位、社会团体,以及其他组织机构颁发的唯一的、始终不变的代码标识。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">由8位数字(或大写字母)本体代码和1位数字(或大写字母)校验码组成。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">三证合一、五证合一之后,组织机构代码已经被统一社会信用代码取代。</div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.OrganizationCodeExplainPopup.Unhelpful" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="CompanyDetail.GongShang.OrganizationCodeExplainPopup.Helpful" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td colspan="2">35522*****</td>
            </tr>
            <tr>
                <td>营业期限</td>
                <td><span>2015-08-25&nbsp;&nbsp;2040-08-24</span></td>
                <td>纳税人资质</td>
                <td>一般纳税人</td>
                <td>核准日期</td>
                <td>2020-12-31</td>
            </tr>
            <tr>
                <td>公司类型</td>
                <td>有限责任公司(法人独资)</td>
                <td>行业</td>
                <td>商务服务业</td>
                <td>人员规模</td>
                <td>-</td>
            </tr>
            <tr>
                <td>参保人数</td>
                <td>0</td>
                <td>登记机关</td>
                <td colspan="3">北京市市场监督管理局</td>
            </tr>
            <tr>
                <td>曾用名</td>
                <td>-</td>
                <td>英文名称</td>
                <td colspan="3">-</td>
            </tr>
            <tr>
                <td>注册地址<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="width:310px;
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">企业注册地址是指在营业执照上登记的“住址”,一般情况下为主要办事机构所在地,不同的城市对注册地址的要求不一样。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">因公司搬迁、业务规模扩大等原因,企业需要进行地址变更。<a class="link-click-more" href="https://www.tianyanqifu.com/goods-list?q=注册地址变更" target="_blank" sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册地址提示&amp;button=我要办理注册地址变更"><span class="link-text">我要办理注册地址变更</span><i class="tic tic-bread-right-icon"></i></a></div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册地址提示&amp;button=没有帮助" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=注册地址提示&amp;button=有帮助" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td colspan="5">北京市西城区复兴门内大街*****
                    <!--<span class="tic tic-fujin c9"></span>--><a class="link-click link-spacing" href="https://www.tianyancha.com/map/1341875964" tyc-event-click="" tyc-event-ch="CompangyDetail.Gongshang.NearCompany" target="_blank">附近公司</a></td>
            </tr>
            <tr>
                <td>经营范围<div class="data-describe " style=""><i class="tic icon tic-circle-question-o" style="color: #ACCCE6;"></i>
                        <div class="warp  -normal  -middle  right-center" id="search_query" sensors-observe="" style="width:360px;
      ; z-index: 503;">
                            <div class="triangle" style=""></div>
                            <div style="">
                                <!--当不展示底部回馈按钮的时候只展示一个name字段-->
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">经营范围是指企业可以从事的生产经营与服务项目,是进行公司注册申请时的必填项。</div>
                                </div>
                                <div class="item" style=""><span class="border"></span>
                                    <div class="content">随着公司业务的调整、扩大,需要对营业执照上的经营范围增加或更改。<a class="link-click-more" href="https://www.tianyanqifu.com/goods-list?q=经营范围变更" target="_blank" sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=经营范围提示&amp;button=我要办理经营范围变更"><span class="link-text">我要办理经营范围变更</span><i class="tic tic-bread-right-icon"></i></a></div>
                                </div>
                                <div class="feedback-button"><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=经营范围提示&amp;button=没有帮助" href="javascript:void(0);" onclick="window.clickInconducive(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useless.png">没有帮助</a><a sensors-event-click="" sensors-event-ch="page_button_click?page=公司详情页&amp;module=经营范围提示&amp;button=有帮助" href="javascript:void(0);" onclick="window.clickHelpful(event)"><img width="32" height="32" src="https://cdn.tianyancha.com/resources/images/icon_useful.png">有帮助</a></div>
                            </div>
                        </div>
                    </div>
                </td>
                <td colspan="5"><span class="">非证券业务的投资、投资管理、咨询。(市场主体依法自主选择经营项目,开展经营活动;依法须经批准的项目,经相关部门批准后依批准的内容开展经营活动;不得从事国家和本市产业政策禁止和限制类项目的经营活动。)</span></td>
            </tr>
        </tbody>
    </table>
</div>

2.pandas处理的代码:

import pandas as pd
from requests_html import HTML

pd.set_option('display.width',None)
pd.set_option("display.max_rows", 1000)#可显示1000行
pd.set_option("display.max_columns", 1000)#可显示1000列
data = pd.read_csv("tianyancha.csv")


# 解析html文档
def parse_html(doc):
    infos = []
    html = HTML(html=doc)
    for items in html.find('tr td'): # 遍历tr标签下的全部td标签
        name = items.text
        # 处理-> 天眼评分的对应值
        if items.find('.sort-bg .sort-score-value'):
            name = name.split('\n')[0].replace("评分", "")
        # 处理-> 法定代表人的对应值
        elif items.find('.shadow'):
            name = name.split('\n')[1]
        # 处理-> 其他td标签的值
        else:
            name = name.split('\n')[0].replace("\xa0", "").replace("附近公司", "")
        infos.append(name)
    return dict(zip(infos[0::2],infos[1::2]))


# data['content'].map(parse_html) # 全部数据
data.head(2)['content'].map(parse_html) # 前面两行数据





def reset_data(series):
    params = {}
    params['公司名称'] = series['company']
    params['链接'] = series['company-href']
    content_dict = parse_html(series['content'])
    params.update(content_dict)
    return params
    
    
params = data.head(2).apply(reset_data,axis=1)
params







params = data.apply(reset_data,axis=1)
df = pd.DataFrame.from_dict(dict(params), orient='index')
df.to_csv("./data.csv")
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

迷心兔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值