Python 爬虫（Selenium+lxml）

最新推荐文章于 2022-11-11 21:24:30 发布

pylduck

最新推荐文章于 2022-11-11 21:24:30 发布

阅读量1.7k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/pylduck/article/details/103349711

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

爬虫东方财富网

界面和network信息先贴上。我们需要实现获得该页面的资产负债数据，并能指定code切换企业。

这里还有Python js执行，我们先安装一个selenium。selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题。安装命令：conda install selenium

然后分析检查这张页面的源码，找到我们需要的资产负债表。我这里提取了一下，结构如下：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>

    <meta charset="utf-8"/>
    ...
    <title>东方财富(300059.SZ)新财务分析-PC_HSF10资料</title>
    ...
</head>
<body>
...
<div class="main">
    ...
    <div id="divBody">
        ...
        <div class="section">
            ...
            <div class="content" style="text-align:center;">
                ...
                <i class="prev" id="zcfzb_prev" style="display: none;"></i>
                <i class="next" id="zcfzb_next" style="display: inline;"></i>
                <div class="tab tips-fontsize">
                    <ul id="zcfzb_ul">
                        <li class="first current" reportdatetype="0" reporttype="1">
                            <span>按报告期</span>
                        </li>
                        ...
                    </ul>
                </div>
                <table id="report_zcfzb" style="table-layout: fixed;">
                    <tr>
                        <td>
                            <img src="/Content/SoftImages/loading.gif" alt="">
                        </td>
                    </tr>
                </table>
            </div>
        </div>
        ...
    </div>
    ...
</div>
</body>
</html>

检查html可以发现我们要爬取的资产负债数据表，对应的元素id就是report_zcfzb。这里有一个问题。就是这个table的数据是由源码中的js生成的，直接的静态页面用了一个图片占位。我们就需要用到selenium的wait方法，在表格行列生成后再获得并解析。按上一篇的步骤我们来分布进行爬虫。

1.发起请求

我们使用selenium的webdriver来实现。早期的selenium结合PhantomJS来进行相关js操作，但是新版本的selenium已经不支持PhantomJS，改用Chrome或Firefox的无头版本来替代。你可以降低selenium版本，或直接使用无头模式。安装selenium命令：conda install selenium 。请求部分功能代码如下：

chrome_options = Options()
chrome_options.add_argument('--disable-gpu')  # 谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--hide-scrollbars')  # 隐藏滚动条, 应对一些特殊页面
chrome_options.add_argument('blink-settings=imagesEnabled=false')  # 不加载图片, 提升速度
chrome_options.add_argument('--headless')  # 浏览器不提供可视化页面
driver = webdriver.Chrome(executable_path='D:\software\chromedriver2.3.2\chromedriver.exe', options=chrome_options)
driver.get(url + "?code=" + code)
wait = WebDriverWait(driver, 15)
        wait.until(
            EC.presence_of_element_located(
                (By.XPATH, '//table[@id="report_zcfzb"]//th[@class="tips-colname-Left"]')))  # wait方法 等待table下th 出现

以上使用的是Chrome无头模式（headless）。其实就是模拟Chrome浏览器访问。如果不增加options，运行代码会打开Chrome浏览器（这种方式可以用来界面操作登陆等）。

需要注意的部分是executable_path，表示对webdriver指定chromedriver。你需要去下载当前Chrome浏览器对应版本的driver，下载地址为：http://chromedriver.storage.googleapis.com/index.html。每个版本对应不同版本的driver，对应关系百度了一个图，如下：

我的Chrome版本是60，开始下载的2.33报错，后来用的2.32通过。driver下载到本地一般为压缩包，解压至选中路径，将代码中的executable_path指向chromedriver.exe即可。

到了driver.get(url)这步，表示开始执行请求。接下来使用的WebDriverWait，则是用来实现篇前我们说的js未执行，表格数据未加载完成的问题，引用出处为：from selenium.webdriver.support.wait import WebDriverWait。这个资产负债的表格加载完成的结构如下：

<table id="report_zcfzb" style="table-layout: fixed;">
    <tbody>
    <tr>
        <th class="tips-colname-Left" style="width: 366px;"><span>资产负债表</span></th>

        <th><span>2019-09-30</span></th>

        <th><span>2019-06-30</span></th>

        <th><span>2019-03-31</span></th>

        <th><span>2018-12-31</span></th>

        <th><span>2018-09-30</span></th>

    </tr>
    <tr>
        <td class="tips-fieldname-Left" style="font-weight:bold;"><span>流动资产</span></td>

        <td class="tips-data-Right"><span></span></td>

        <td class="tips-data-Right"><span></span></td>

        <td class="tips-data-Right"><span></span></td>

        <td class="tips-data-Right"><span></span></td>

        <td class="tips-data-Right"><span></span></td>

    </tr>
    <tr>
        <td class="tips-fieldname-Left"><span>    货币资金</span></td>

        <td class="tips-data-Right"><span>191.9亿</span></td>

        <td class="tips-data-Right"><span>218.6亿</span></td>

        <td class="tips-data-Right"><span>290.6亿</span></td>

        <td class="tips-data-Right"><span>113.3亿</span></td>

        <td class="tips-data-Right"><span>107.5亿</span></td>

    </tr>
    ...
    </tbody>
</table>

wait = WebDriverWait(driver, 15)
wait.until(
EC.presence_of_element_located(
(By.XPATH, '//table[@id="report_zcfzb"]//th[@class="tips-colname-Left"]')))

这两句代码，表示在 id为report_zcfzb的table下，class为tips-colname-Left的th元素未出现之前，等待15s。EC的引用出处是：from selenium.webdriver.support import expected_conditions as EC，这个下面还有其他一些方法，如visibility_of_element_located:判断某个元素是否可见，详细的可以自己研究。这里不再赘述。

执行到这里，我们就获得了如上图结构的table数据，如果需要查看，可以使用open方法保存html到本地查看。

2.获取响应内容

这没什么特殊的，发出请求成功后，使用driver.page_source就获得了返回数据。当然，如果可以，自己添加一个超时、网络判断更好。

3.解析内容

这里我们使用lxml，你也可以使用re正则、BeautifulSoup库、PyQuery库等。lxml+xpath相对来说解析速度快一点。

引用加上：from lxml import etree。获得dom结构内容，html = etree.HTML(driver.page_source)

然后分析截图里table和整体dom的结构，我们把xpath的路径写好，并进行遍历取值。

xpath的常用规则

（1）直接使用节点名称：选取当前节点的所有子节点
（2）/：从根节点开始选取（绝对路径）
（3）//：从匹配到的节点选取（相对路径）
（4）.：选取当前节点
（5）..：选取当前节点的父节点

这是一个列表和字典的结合，输出结构应该为：

[{"name": 货币资金, "2019-09-30": "191.9亿", "2019-06-30": "218.6亿", "2019-03-31": "290.6亿", ......"group": “流动资产"},.....]

接下来我们进行解析，先获得表头，xpath应该为//table[条件]//tr/th

table = html.xpath("//table[@id='report_zcfzb']")[0]  # xpath得到的是数组
trlist = table.xpath(".//tr")  # table节点子孙目录下所有行
tdheads = trlist[0].xpath("./th")  # 获得表头所有列 tr/th/

遍历tdheads，插入list。并插入第一个name，用来表示接下来行第一列名称。

然后从第一行开始遍历，获得大分类，大分类按照style条件获得。存入变量group；

tdspecial = tr.xpath("./td[@style='font-weight:bold;']")
group = tdspecial[0].xpath("./span/text()")[0].strip()

获得分类下每行数据，存入字典。

ths = tr.xpath("./td")
data[item] = ths[i].xpath("./span/text()")[0].strip()

strip是去除前后空格。

最终我们得到一个列表list。

4.保存数据

表格数据常见的是保存数据库，保存json字符串。数据库暂时不做，我们存入csv和json文件。

引用csv和json，都是Python自带的包，不用安装。

保存csv文件：这里的headarr为表头列表，list为多条字典列表，表头必须和字典一一对应。

with open('zcfz.csv', 'w', encoding='utf-8', newline='')as f:
    f_csv = csv.DictWriter(f, headarr)
    f_csv.writeheader()
    f_csv.writerows(list)

保存json文件

with open('zcfz.json', 'w', encoding='utf-8') as fp:
    json.dump(res2, fp, ensure_ascii=False)

到这里我们就完成了对财富网的资产负债表的爬虫，切换code，增加输入即可。

完整代码贴上

import os

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from lxml import etree
import csv  # csv存储
import json  # json存储


# 该页面 表格数据由js生成 需要1、等待js执行结束爬页面元素(本例子) 2、直接调用js得到数据

def spider(url, code, save):
    chrome_options = Options()
    chrome_options.add_argument('--disable-gpu')  # 谷歌文档提到需要加上这个属性来规避bug
    chrome_options.add_argument('--hide-scrollbars')  # 隐藏滚动条, 应对一些特殊页面
    chrome_options.add_argument('blink-settings=imagesEnabled=false')  # 不加载图片, 提升速度
    chrome_options.add_argument('--headless')  # 浏览器不提供可视化页面
    driver = webdriver.Chrome(executable_path='D:\software\chromedriver2.3.2\chromedriver.exe', options=chrome_options)
    try:
        driver.get(url + "?code=" + code)
        wait = WebDriverWait(driver, 15)
        wait.until(
            EC.presence_of_element_located(
                (By.XPATH, '//table[@id="report_zcfzb"]//th[@class="tips-colname-Left"]')))  # wait方法 等待table下th 出现
        # with open('b.html', 'w', encoding='utf-8') as f:
        #     f.write(driver.page_source) # 得到js执行后的text
        html = etree.HTML(driver.page_source)
        table = html.xpath("//table[@id='report_zcfzb']")[0]  # xpath得到的是数组
        # print(table.xpath('@style'))  # ['table-layout: fixed;'] 到这一步得到table目录
        trlist = table.xpath(".//tr")  # table节点子孙目录下所有行
        tdheads = trlist[0].xpath("./th")  # 获得表头所有列 tr/th/
        # tdheads = trlist[0].xpath("./th[position()>1]")  # 获得表头所有列 跳过第一个列
        d = 0;
        headarr = []
        for head in tdheads:
            headarr.append(head.xpath("./span/text()")[0].strip())
            d += 1
        headarr[0] = 'name'
        print('一共列数：', d)  # 一共列
        # print(headarr)
        # ***********************遍历数据行**********************************
        row = 0
        list = []
        group = ''
        for tr in trlist:
            if row > 0:  # 跳过第一行
                # 处理font-weight:bold; 大分类节点
                tdspecial = tr.xpath("./td[@style='font-weight:bold;']")  # 第一列加粗表示为大分类这行
                data = {}
                if (len(tdspecial) != 0):
                    group = tdspecial[0].xpath("./span/text()")[0].strip()
                    continue
                ths = tr.xpath("./td")  # 获得当前行所有列 tr/td/
                i = 0
                for item in headarr:
                    data[item] = ths[i].xpath("./span/text()")[0].strip()
                    i += 1;
                data['group'] = group
                # print(data)
                list.append(data)
            row += 1
        print('一共行数：', row)  # 一共行
        res2 = json.dumps(list)
        headarr.append('group')
        if save == 1:
            with open('zcfz.json', 'w', encoding='utf-8') as fp:
                json.dump(res2, fp, ensure_ascii=False)  # 保存json文件
        elif save == 2:
            with open('zcfz.csv', 'w', encoding='utf-8', newline='')as f:
                f_csv = csv.DictWriter(f, headarr)
                f_csv.writeheader()
                f_csv.writerows(list)
        print("保存成功")
    # except:
    #     print("error！！")
    finally:
        driver.close()  # 切记关闭浏览器，回收资源


if __name__ == '__main__':
    code = input("请输入code：")
    save = int(input("请输入保存方式 json(1)/csv(2):"))  #
    print("*****************保存文件请在本地文件夹查看*************************")
    url = 'http://emweb.securities.eastmoney.com/NewFinanceAnalysis/Index'
    spider(url, code, save)
    os.system("pause")  # 使程序发布后可见

# 打包exe 安装 pyinstaller conda install pyinstaller
# 打包pyinstaller -F Finance.py 先打开cmd, 再输入"xxx.exe"执行