Selenium 爬虫应用的学习

最新推荐文章于 2024-07-18 15:36:49 发布

莫空0000

最新推荐文章于 2024-07-18 15:36:49 发布

阅读量433

点赞数 2

分类专栏： Python 文章标签： selenium python

本文链接：https://blog.csdn.net/weixin_42462552/article/details/103613215

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

一、介绍

Selenium 是一个用于web应用程序自动化测试的工具，直接运行在浏览器当中，支持Chrome、Firefox等主流浏览器。可以通过代码控制与页面上元素进行交互（点击、输入等），也可以获取指定元素的内容，所以我们可以用它做爬虫。

二、安装

1.Selenium

pip install selenium

2.webdriver
不同的浏览器需要下载不同的webdriver，这里介绍两个浏览器的webdriver，差不多也够用了

Chrome
下载链接：https://sites.google.com/a/chromium.org/chromedriver/downloads（需要访问外网）

http://npm.taobao.org/mirrors/chromedriver(正常访问)

版本需要相对应，地址栏输入`chrome://version可以查看chrome的版本，我的chrome版本是79的，所以我下载了 ChromeDriver 79的版本
在这里插入图片描述

在这里插入图片描述
下载后解压，将其中的chromedriver.exe文件放到python的安装路径下，或者将该文件添加环境变量，因为python的安装路径之前肯定已经添加过环境变量了，所以放在那下面就不用添加环境变量了。

Firefox
下载链接：
https://github.com/mozilla/geckodriver/releases
打开右上角菜单，点击【帮助】，点击【关于Firefox】,可查看firefox版本
我的版本是71.0

目前geckodriver的最新版本是V0.26.0，支持firefox 60以上的版本，所以我下载这个

往下走有下载链接，根据自己电脑的版本，下载对应的资源

下载后解压，将其中的geckodriver.exe文件放到python的安装路径下，或者将该文件添加环境变量，因为python的安装路径之前肯定已经添加过环境变量了，所以放在那下面就不用添加环境变量了。

三、测试

1、Chrome

有界面模式

#！/user/bin/env python
#- *- coding:utf-8 -*-
from selenium import webdriver

driver = webdriver.Chrome()
url = 'http://www.baidu.com'
#打开网页
driver.get(url)
print(driver.title)
driver.quit()

无界面模式

#！/user/bin/env python
#- *- coding:utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#chrome浏览器无头模式
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)

url = 'http://www.baidu.com'
#打开网页
driver.get(url)
print(driver.title)
driver.quit()

两段代码都会输出 百度一下，你就知道

无界面模式可能会有Warning：

DeprecationWarning: use options instead of chrome_options
  driver = webdriver.Chrome(chrome_options=chrome_options)

可以把

driver = webdriver.Chrome(chrome_options=chrome_options)

改成

driver = webdriver.Chrome(options=chrome_options)

2、Firefox

有界面模式

#！/user/bin/env python
#- *- coding:utf-8 -*-
from selenium import webdriver

driver = webdriver.Firefox()
url = 'http://www.baidu.com'
#打开网页
driver.get(url)
print(driver.title)
driver.quit()

无界面模式

#！/user/bin/env python
#- *- coding:utf-8 -*-
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('-headless')
driver = webdriver.Firefox(options=options)

url = 'http://www.baidu.com'
#打开网页
driver.get(url)
print(driver.title)
driver.quit()

两段代码都会输出 百度一下，你就知道

四、实战

我们要爬取的网站是http://fund.eastmoney.com/fundguzhi.html
在这里插入图片描述
先亮一下代码

#！/user/bin/env python
#- *- coding:utf-8 -*-
import random
import time
import mysql.connector
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#基金代码
Code =[]
#基金名称
Name =[]
#估算值
Estimated_value =[]
#估算增长率
Estimated_growth_rate =[]
#今日单位净值
Today_net_unit_value =[]
#今日日增长率
Today_daily_growth_rate=[]
#估算偏差
Estimation_bias = []
#昨日单位净值
Yesterday_net_unit_value =[]
#是否可购
Is_buy =[]

#chrome浏览器无头模式
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)


url = 'http://fund.eastmoney.com/fundguzhi.html'
#打开网页
driver.get(url)
#通过requests打开网页获取状态码，判断网页是否成功打开
r = requests.get(url)
if r.status_code != 200:
    print('连接错误')
    #退出程序
    exit()
#延时5秒，等待浏览器加载页面
time.sleep(5)
for i in range(2):
    print('已到达第%s页'%str(i+1))
    #xpath 定位标签，获取数据
    try:
        code = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[3]')
        name = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[4]/a[1]')
        estimated_value = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[5]')
        estimated_growth_rate =driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[6]')
        today_net_unit_value= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[7]')
        today_daily_growth_rate= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[8]')
        estimation_bias= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[9]')
        yesterday_net_unit_value= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[10]')
        is_buy = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[11]/a')
        print('已获取%s页数据,等待中'% str(i + 1))

        for j in range(len(code)):
            # 基金代码
            Code.append(code[j].text)
            # 基金名称
            Name.append(name[j].text)
            # 估算值
            Estimated_value.append(estimated_value[j].text)
            # 估算增长率
            Estimated_growth_rate.append(estimated_growth_rate[j].text)
            # 今日单位净值
            Today_net_unit_value.append(today_net_unit_value[j].text)
            # 今日日增长率
            Today_daily_growth_rate.append(today_daily_growth_rate[j].text)
            # 估算偏差
            Estimation_bias.append(estimation_bias[j].text)
            # 昨日单位净值
            Yesterday_net_unit_value.append(yesterday_net_unit_value[j].text)

            #判断是否可购买
            isbuy = ''.join(is_buy[j].get_attribute('class'))
            #没有找到则返回-1
            if isbuy.find('gray') == -1:
                isbuy = '1'
            else:
                isbuy = '0'

            # 是否可购
            Is_buy.append(isbuy)
        time.sleep(random.choice(range(1, 5)))  # 随机延时一段时间，以免反爬虫
        # 点击翻页
        driver.find_element_by_xpath('//a[@class="next ttjj-iconfont"]').click()
    except Exception as  e:
        print(e)
#关闭浏览器
driver.quit()
print('数据采集完成，关闭浏览器')
#打印数据
# for j in range(len(Code)):
#     print(Code[j], Name[j], Estimated_value[j], Estimated_growth_rate[j], Today_net_unit_value[j],
#               Today_daily_growth_rate[j], Estimation_bias[j], Yesterday_net_unit_value[j], Is_buy[j])
#退出程序
#exit()
print('开启数据库，正在保存数据')


try:
    conn = mysql.connector.connect(
                host='localhost', user='root', database='stocks', port='3306', password='123456',
                use_unicode=True)
    # 获取游标
    cur = conn.cursor()
    for j in range(len(Code)):
        values = [Code[j],Name[j],Estimated_value[j],Estimated_growth_rate[j],Today_net_unit_value[j],
                  Today_daily_growth_rate[j],Estimation_bias[j],Yesterday_net_unit_value[j],Is_buy[j]]
        cur.execute("insert into estimation_of_net_worth_table values (%s,%s,%s,%s,%s,%s,%s,%s,%s)", values)
        conn.commit()
        if (j+1)%100 == 0:
            print('正在保存第%s条数据'% str(j+1))
    # 关闭游标
    cur.close()
    # 关闭连接
    conn.close()
    print('数据保存完成，关闭数据库')
except Exception as  e:
    print(e)

下面开始讲解代码

1.定义字段

做爬虫之前先想好我们需要获取哪些数据，再开始写代码。
首先我们定义几个List，用于存放各列的数据

#基金代码
Code =[]
#基金名称
Name =[]
#估算值
Estimated_value =[]
#估算增长率
Estimated_growth_rate =[]
#今日单位净值
Today_net_unit_value =[]
#今日日增长率
Today_daily_growth_rate=[]
#估算偏差
Estimation_bias = []
#昨日单位净值
Yesterday_net_unit_value =[]
#是否可购
Is_buy =[]

2. 打开浏览器，加载网页

#chrome浏览器无头模式
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)

url = 'http://fund.eastmoney.com/fundguzhi.html'
#打开网页
driver.get(url)
#通过requests打开网页获取状态码，判断网页是否成功打开
r = requests.get(url)
if r.status_code != 200:
    print('连接错误')
    #退出程序
    exit()
#延时5秒，等待浏览器加载页面
time.sleep(5)

首先我们创建一个浏览器，使用get打开网页，打开网页后，我们需要判断这个网页是否成功打开，不然后面的爬取也就没有意义了，由于我不清楚Selenium 怎么获取状态码（知道的朋友欢迎私信），所以我使用requests库再次打开一次，判断它的状态码status_code 是否是200，如果不是200，就退出程序。判断完后延时5秒(需要导入import time)，等待浏览器加载页面，如果浏览器还没加载完，后面的标签定位可能会出问题。

requests库需要下载，不然会报错

pip install requests

3.定位标签，获取数据

for i in range(2):
    print('已到达第%s页'%str(i+1))
    #xpath 定位标签，获取数据
    try:
        code = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[3]')
        name = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[4]/a[1]')
        estimated_value = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[5]')
        estimated_growth_rate =driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[6]')
        today_net_unit_value= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[7]')
        today_daily_growth_rate= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[8]')
        estimation_bias= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[9]')
        yesterday_net_unit_value= driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[10]')
        is_buy = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[11]/a')
        print('已获取%s页数据,等待中'% str(i + 1))

        for j in range(len(code)):
            # 基金代码
            Code.append(code[j].text)
            # 基金名称
            Name.append(name[j].text)
            # 估算值
            Estimated_value.append(estimated_value[j].text)
            # 估算增长率
            Estimated_growth_rate.append(estimated_growth_rate[j].text)
            # 今日单位净值
            Today_net_unit_value.append(today_net_unit_value[j].text)
            # 今日日增长率
            Today_daily_growth_rate.append(today_daily_growth_rate[j].text)
            # 估算偏差
            Estimation_bias.append(estimation_bias[j].text)
            # 昨日单位净值
            Yesterday_net_unit_value.append(yesterday_net_unit_value[j].text)

            #判断是否可购买
            isbuy = ''.join(is_buy[j].get_attribute('class'))
            #没有找到则返回-1
            if isbuy.find('gray') == -1:
                isbuy = '1'
            else:
                isbuy = '0'

            # 是否可购
            Is_buy.append(isbuy)
        time.sleep(random.choice(range(1, 5)))  # 随机延时一段时间，以免反爬虫
        # 点击翻页
        driver.find_element_by_xpath('//a[@class="next ttjj-iconfont"]').click()
    except Exception as  e:
        print(e)
#关闭浏览器
driver.quit()
print('数据采集完成，关闭浏览器')

做个for循环，我们意思意思爬两页就好了，达到翻页爬取效果就行了。
首先使用find_elements_by_xpath方法定位标签，获取数据，将会返回一个列表，xpath不会的，先去学xpath。当然也可以别的方法定位，查看更多方法
这里要注意的是，获取【是否可购】按钮的class属性，在Selenium 中不能通过xpath来直接获取，这样是错误的,

is_buy = driver.find_elements_by_xpath('//tbody[@id="tableContent"]/tr/td[11]/a/@href')

需要使用get_attribute('class')来进行获取
接下来，开始遍历find_elements_by_xpath返回的列表，将它保存下来
其中这段代码的功能是判断【购买】按钮是否是灰色，查看源代码可以看到灰色按钮的class是ui-btn ui-btn-xs ui-btn-gray，橙色按钮的class是ui-btn ui-btn-xs ui-btn-orange。程序判断字符串中是否有gray，从而得知是否可以购买。

<a class="ui-btn ui-btn-xs ui-btn-gray" href="javascript:void(0);" target="_self">购买</a>

<a class="ui-btn ui-btn-xs ui-btn-orange" href="https://trade.1234567.com.cn/FundtradePage/default2.aspx?fc=161834" target="_blank">购买</a>

在这里插入图片描述

#判断是否可购买
            isbuy = ''.join(is_buy[j].get_attribute('class'))
            #没有找到则返回-1
            if isbuy.find('gray') == -1:
                isbuy = '1'
            else:
                isbuy = '0'

            # 是否可购
            Is_buy.append(isbuy)

接下来是随机延时一段时间，然后翻页。
查看源代码可以看到下一页的按钮标签

<a href="javascript:;" target="_self" class="next ttjj-iconfont" data-page="2">&gt;</a>

定位class等于"next ttjj-iconfont"的a标签，然后点击

 driver.find_element_by_xpath('//a[@class="next ttjj-iconfont"]').click()

4.保存数据到数据库

到这一步，我们的数据已经爬取完成，把这段代码取消注释，我们可以打印到控制台来看一下

#打印数据
 for j in range(len(Code)):
     print(Code[j], Name[j], Estimated_value[j], Estimated_growth_rate[j], Today_net_unit_value[j],
               Today_daily_growth_rate[j], Estimation_bias[j], Yesterday_net_unit_value[j], Is_buy[j])
#退出程序
exit()

下面是连接mysql数据库,保存数据

try:
    conn = mysql.connector.connect(
                host='localhost', user='root', database='stocks', port='3306', password='123456',
                use_unicode=True)
    # 获取游标
    cur = conn.cursor()
    for j in range(len(Code)):
        values = [Code[j],Name[j],Estimated_value[j],Estimated_growth_rate[j],Today_net_unit_value[j],
                  Today_daily_growth_rate[j],Estimation_bias[j],Yesterday_net_unit_value[j],Is_buy[j]]
        cur.execute("insert into estimation_of_net_worth_table values (%s,%s,%s,%s,%s,%s,%s,%s,%s)", values)
        conn.commit()
        if (j+1)%100 == 0:
            print('正在保存第%s条数据'% str(j+1))
    # 关闭游标
    cur.close()
    # 关闭连接
    conn.close()
    print('数据保存完成，关闭数据库')
except Exception as  e:
    print(e)

首先我们mysql.connector.connect连接数据库，创建游标，然后使用execute执行插入语句，commit提交事务，最后关闭游标和数据库。

到此为止，我们整个项目就完成了。

莫空0000

关注

2
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Selenium 爬虫应用的学习

介绍Selenium 是一个用于web应用程序自动化测试的工具，直接运行在浏览器当中，支持Chrome、Firefox等主流浏览器。可以通过代码控制与页面上元素进行交互（点击、输入等），也可以获取指定元素的内容，所以我们可以用它做爬虫。安装1.Seleniumpip install selenium 2.webdriver不同的浏览器需要下载不同的webdriver，这里介绍两个浏览...
复制链接

扫一扫