爬虫入门_7：动态加载数据处理及案例实战

最新推荐文章于 2024-04-23 15:54:17 发布

Yolanda Yan 9

最新推荐文章于 2024-04-23 15:54:17 发布

阅读量1.3k

点赞数 2

分类专栏：爬虫相关 python 文章标签：爬虫 python chrome selenium

本文链接：https://blog.csdn.net/Amy9_Miss/article/details/123018196

版权

爬虫相关同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

python

10 篇文章 0 订阅

订阅专栏

selenium模块的基本使用

简介

selenium最初是一个自动化测试工具，而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题。selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到页面渲染之后的结果，可支持多种浏览器。

问题：selenium模块和爬虫之间具有怎样的关联？

便捷的获取网站中动态加载的数据
便捷实现模拟登录

selenium模块：基于浏览器自动化的一个模块。

环境安装

下载安装selenium：pip install selenium
下载一个浏览器的驱动程序
chromedriver的版本一定要与Chrome的版本一致，不然就不起作用
1. 首先需要查看你的Chrome版本，在浏览器中输入chrome://version/
1. 下载浏览器驱动地址：http://chromedriver.storage.googleapis.com/index.html
我这里下载的win32的，在win64上也能正常运行
1. 解压压缩包，找到chromedriver.exe复制到对应目录

selenium使用流程：

实例化一个浏览器对象
编写基于浏览器自动化的操作代码
- 发起请求：get(url)
- 标签定位：find系列的方法
- 标签交互：send_keys(‘xxx’)
- 执行js程序：excute_script(‘jsCode’)
- 前进，后退：forward(), back()
- 关闭浏览器：quit()
selenium处理iframe
- 如果定位的标签存在于iframe标签之中，则必须使用switch_to.frame(id)
- 动作链（拖动）：from selenium.webdriver import ActionChains
  - 实例化一个动作链对象：action = ActionChains(bro)
  - action.click_and_hold(div)：长按且点击操作
  - action.move_by_offset(xoffset=x, yoffset=y)
  - perform():让动作链立即执行
  - action.release()：释放动作链对象

简单实用/效果展示

谷歌无头浏览器

谷歌的无头浏览器，是一款无界面的谷歌浏览器。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

# 创建一个参数对象，用来控制Chrome以无界面模式打开
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 驱动路径
path = r"./chromedriver.exe"

# 创建浏览器对象
bro = webdriver.Chrome(service=Service(path), chrome_options=chrome_options)

url = 'http://www.baidu.com'
bro.get(url)
sleep(3)

# 将当前页面进行截图且保存
bro.save_screenshot('./result/baidu.png')
bro.quit()  # 退出浏览器

selenium规避被检测识别

现在不少大网站有对selenium采用了监测机制。比如正常情况下我们用浏览器访问淘宝等网站的window.navigator.webdriver的值为undefined。而使用selenium访问则该值为true。那么如何解决这个问题？

只需要设置Chromedriver的启动参数即可解决问题。在启动Chromedriver之前，为Chrome开启实验性功能参数excludeSwitches，它的值为['enable-automation']，完整代码如下：

from selenium import webdriver
from selenium.webdriver import ChromeOptions

# 实现规避检测
option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(options=option)

谷歌无头浏览器+反检测

from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# 实现无可视化界面
from selenium.webdriver.chrome.options import Options
# 实现规避检测
from selenium.webdriver import ChromeOptions

if __name__ == '__main__':
    # ########## 实现无可视化界面（无头浏览器）的操作 ###########
    # 实例化一个options对象
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')

    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches',['enable-automation'])

    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), chrome_options=chrome_options,options=option)

    # 无可视化界面（无头浏览器）
    bro.get('https://www.baidu.com')

    print(bro.page_source)
    sleep(2)

    # 关闭浏览器
    bro.quit()

处理页面中嵌套子页面的获取情况

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains  # 导入动作链对应的类
from time import sleep
# 实现规避检测
from selenium.webdriver import ChromeOptions


if __name__ == '__main__':
    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 实例化一个浏览器对象（传入浏览器的驱动程序）
    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), options=option)
    # 让浏览器发起一个指定url对应请求
    bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

    # 如果定位的标签是存在于 iframe标签之中的，必须通过如下操作再进行标签定位
    bro.switch_to.frame('iframeResult')  # 切换浏览器标签定位的作用域
    div = bro.find_element(By.ID,'draggable')
    # print(div)

    # 动作链
    action = ActionChains(bro)
    # 点击长按指定的标签
    action.click_and_hold(div)

    for i in range(15):
        # perform()：立即执行动作链操作
        # move_by_offset(x,y):x水平方向，y竖直方向
        action.move_by_offset(xoffset=17, yoffset=0).perform()  # 一次性水平偏移17个像素
        sleep(0.3)
    # 释放动作链
    action.release()

    bro.quit()  # 关闭浏览器

实战

需求1：百度搜索美食图片

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
from time import sleep
# 实现规避检测
from selenium.webdriver import ChromeOptions


if __name__ == '__main__':
    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 后面是你的浏览器驱动位置，记得前面加r,'r'是防止字符转义的
    driver = webdriver.Chrome(service=Service(r"./chromedriver.exe"), options=option)
    # 用get打开百度网页
    driver.get('http://www.baidu.com')
    # 查找页面的“设置”选项，并进行点击
    element = driver.find_element(By.XPATH, '//*[@id="s-usersetting-top"]')  # 定位到要悬停的元素
    ActionChains(driver).move_to_element(element).perform()  # 对定位到的元素执行鼠标悬停操作
    sleep(2)
    # 打开设置后找到“搜索设置”选项，设置为每页显示50条
    driver.find_element(By.LINK_TEXT, "搜索设置").click()
    sleep(2)
    # 选中每页显示50条
    driver.find_element(By.XPATH, '//*[@id="se-setting-3"]/span[3]/label').click()
    sleep(2)
    # 点击保存设置
    driver.find_element(By.CLASS_NAME, "prefpanelgo").click()
    sleep(2)
    # 处理弹出的警告页面: 确定accept()和取消dismiss()
    driver.switch_to.alert.accept()
    sleep(2)
    # 找到百度的输入框，并输入 美食
    driver.find_element(By.ID, 'kw').send_keys('美食')
    sleep(2)
    # 点击搜索按钮
    driver.find_element(By.ID, 'su').click()
    sleep(2)
    # 在打开的页面中找到“美食 - 百度图片”，并打开这个页面
    driver.find_element(By.LINK_TEXT, '美食 - 百度图片').click()
    sleep(3)
    # 关闭浏览器
    driver.quit()

需求2：通过selenium爬取药监总局的企业名称

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from lxml import etree
from time import sleep
# 实现规避检测
from selenium.webdriver import ChromeOptions

if __name__ == '__main__':
    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 实例化一个浏览器对象（传入浏览器的驱动程序）
    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), options=option)
    # 让浏览器发起一个指定url对应请求
    bro.get('http://scxk.nmpa.gov.cn:81/xk/')
    # 获取浏览器当前页面源码数据
    page_text = bro.page_source

    # 解析企业名称
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//ul[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)

    sleep(5)
    bro.quit()  # 关闭浏览器

需求3：淘宝网站搜索关键词，并进行浏览器前进和后退

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from time import sleep
# 实现规避检测
from selenium.webdriver import ChromeOptions

if __name__ == '__main__':
    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 实例化一个浏览器对象（传入浏览器的驱动程序）
    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), options=option)
    # 让浏览器发起一个指定url对应请求
    bro.get('https://www.taobao.com/?spm=a1z02.1.1581860521.1.lgieYS')
    # 标签定位s
    search_input = bro.find_element(By.ID,'q')
    # 标签交互
    search_input.send_keys('Iphone')

    # 执行一组js程序
    bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    sleep(2)

    # 点击搜索按钮
    btn = bro.find_element(By.CSS_SELECTOR,'.btn-search')
    btn.click()

    bro.get('https://www.baidu.com')
    sleep(2)
    bro.back()  # 当前浏览器进行回退，即返回按钮
    sleep(2)
    bro.forward()  # 当前浏览器前进

    sleep(5)
    bro.quit()   # 关闭浏览器

需求4：模拟登录QQ

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from time import sleep
# 实现规避检测
from selenium.webdriver import ChromeOptions

if __name__ == '__main__':
    # 实现让selenium规避被检测到的风险
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 实例化一个浏览器对象（传入浏览器的驱动程序）
    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), options=option)
    # 让浏览器发起一个指定url对应请求
    bro.get('https://qzone.qq.com/')

    bro.switch_to.frame('login_frame')
    a_tag = bro.find_element(By.ID, 'switcher_plogin')
    a_tag.click()

    userName_tag = bro.find_element(By.ID, 'u')
    password_tag = bro.find_element(By.ID, 'p')

    sleep(1)
    userName_tag.send_keys('QQ账户')
    sleep(1)
    password_tag.send_keys('QQ密码')
    sleep(1)
    btn = bro.find_element(By.ID, 'login_button')
    btn.click()

    sleep(3)

    bro.quit()

需求5：基于selenium实现对古诗文网站的模拟登录

模拟登录编码流程

使用selenium打开登录页面
对当前selenium打开的这张页面进行截图
对当前图片局部区域（验证码图片）进行裁剪
- 好处：将验证码图片和模拟登录进行一一对应
识别验证码图片（坐标）

代码

验证码识别封装在VerificationCode.py文件里，具体代码如下：

import re  # 用于正则
from PIL import Image  # 用于打开图片和对图片处理
import pytesseract  # 用于图片转文字


class VerificationCode:
    """识别验证码图片"""
    def __init__(self, img_path):
        self.img_path = img_path

    def processing_image(self):
        """处理图片"""
        image_obj = Image.open(self.img_path)   # 获取验证码图片
        img = image_obj.convert("L")  # 转灰度
        pixdata = img.load()
        w, h = img.size
        threshold = 160
        # 遍历所有像素，大于阈值的为黑色
        for y in range(h):
            for x in range(w):
                if pixdata[x, y] < threshold:
                    pixdata[x, y] = 0
                else:
                    pixdata[x, y] = 255
        return img

    def delete_spot(self):
        images = self.processing_image()
        data = images.getdata()
        w, h = images.size
        black_point = 0
        for x in range(1, w - 1):
            for y in range(1, h - 1):
                mid_pixel = data[w * y + x]  # 中央像素点像素值
                if mid_pixel < 50:  # 找出上下左右四个方向像素点像素值
                    top_pixel = data[w * (y - 1) + x]
                    left_pixel = data[w * y + (x - 1)]
                    down_pixel = data[w * (y + 1) + x]
                    right_pixel = data[w * y + (x + 1)]
                    # 判断上下左右的黑色像素点总个数
                    if top_pixel < 10:
                        black_point += 1
                    if left_pixel < 10:
                        black_point += 1
                    if down_pixel < 10:
                        black_point += 1
                    if right_pixel < 10:
                        black_point += 1
                    if black_point < 1:
                        images.putpixel((x, y), 255)
                    black_point = 0
        # images.show()
        new_img_path = ''.join(self.img_path.split('.png')[:-1]) + '_new.png'
        images.save(new_img_path)
        return new_img_path

    def image_str(self):
        new_img_path = self.delete_spot()
        image = Image.open(new_img_path)  # 读取处理后的图片
        result = pytesseract.image_to_string(image)  # 图片转文字
        resultj = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", result)  # 去除识别出来的特殊字符
        result_four = resultj[0:4]  # 只获取前4个字符
        return result_four

主文件具体代码如下：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions
from time import sleep
from PIL import Image
from VerificationCode import VerificationCode


def getCodeText(imgPath):
    """
    封装识别验证码图片的函数
    :param imgPath:验证码图片路径
    :return: 返回识别的验证码文本
    """
    a = VerificationCode(imgPath)
    result = a.image_str()
    return result


if __name__ == '__main__':
    # 实例化一个options对象
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')

    # 实现规避检测
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    # 利用selenium打开登录页面
    bro = webdriver.Chrome(service=Service(r"./chromedriver.exe"), chrome_options=chrome_options,options=option)

    bro.get('https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx')
    sleep(1)

    # save_screenshot:将当前页面进行截图且保存
    bro.save_screenshot('./result/aa.png')

    # 确定验证码图片对应的左上角和右下角的坐标（裁剪的区域就确定了）
    code_img_ele = bro.find_element(By.XPATH, '//*[@id="imgCode"]')
    location = code_img_ele.location  # 验证码图片左上角的坐标 x,y
    print("location: ", location)
    size = code_img_ele.size  # 验证码图片对应的长和宽
    print("size: ", size)

    # 左上角和右下角坐标
    rangle = (
        int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))

    # 至此验证码图片区域就确定下来了
    i = Image.open('./result/aa.png')
    code_img_name = './result/code.png'
    # crop根据指定区域进行图片裁剪
    frame = i.crop(rangle)
    frame.save(code_img_name)

    # 调用OCR图片识别代码进行验证码图片数据识别，识别率不太高
    code_text = getCodeText('./result/code.png')
    if code_text == '' or code_text == None:
        print("未识别成功！！！")
    else:
        print("识别结果为：", code_text)

        # 录用用户名和密码,找到登录按钮，点击即可
        bro.find_element(By.ID, 'email').send_keys('用户名')
        sleep(2)
        bro.find_element(By.ID, 'pwd').send_keys('密码')
        sleep(2)
        bro.find_element(By.ID, 'code').send_keys(code_text)
        sleep(2)
        bro.find_element(By.ID, 'denglu').click()
        sleep(10)

        # 判断是否登录成功
        try:
            if bro.find_element(By.XPATH, '//*[@id="html"]/body/div[2]/div[1]/span[1]').text == '我的收藏':
                print("登录成功！！！")
        except:
            print("验证码识别错误，导致登录失败！！！")
    bro.quit()

运行结果

在这里插入图片描述
如果本文对你有帮助，记得“点赞、收藏”哦~

Yolanda Yan 9

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
爬虫入门_7：动态加载数据处理及案例实战

selenium模块的基本使用简介selenium最初是一个自动化测试工具，而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题。selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到页面渲染之后的结果，可支持多种浏览器。问题：selenium模块和爬虫之间具有怎样的关联？便捷的获取网站中动态加载的数据便捷实现模拟登录selenium模块：基于浏览器自动化的一个模块。环境安装下载安装selenium：pip
复制链接

扫一扫