【Selenium】selenium 相关

最新推荐文章于 2022-07-24 19:37:34 发布

FeatureOverload

最新推荐文章于 2022-07-24 19:37:34 发布

阅读量726

点赞数 1

分类专栏：爬虫自动化测试

本文链接：https://blog.csdn.net/qq_29757283/article/details/93330737

版权

爬虫同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

自动化测试

2 篇文章 0 订阅

订阅专栏

Selenium 何来？

待续

1 基本使用

1.1 驱动

FireFox: Releases · mozilla/geckodriver

Selenium打开火狐/google浏览器失败提示WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

1.2 驱动位置

可能下载的 “驱动” 不在 PATH 环境变量中。
则需要为 selenium 指定找驱动的路径。

用法 – 指定 executable_path 参数值：

from selenium import webdriver
browser = webdriver.Chrome(
    executable_path="/<path>/<to>/chromedriver",
)

## ff_browser = webdriver.Firefox(
##     executable_path='/<path>/<to>/geckodriver')

## browser.get("https://www.example.com")

2. 启动配置

2.1 无头模式

Firefox:

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('-headless')
browser = webdriver.Firefox(options=options)

# test:
browser.get("https://www.baidu.com")
browser.save_screenshot("/tmp/baidu.png")  # save image for verify

browser.quit()

Chrome:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

cr_options = Options()
cr_options.headless = True

browser = webdriver.Chrome(
    executable_path="/<path>/<to>/chromedriver",
    chrome_options=cr_options, )

2.2 更换请求头

user-agent:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

## Chrome
cr_options = Options()
cr_options.add_argument(('user-agent='
                         'Mozilla/5.0 ("Windows NT 10.0; Win64; x64") '
                         'AppleWebKit/537.36 (KHTML, like Geoko) '
                         'Chrome/70.0.3538.102 Safari/537.36'))
# cr_options.headless = True  # 无头模式

browser = webdriver.Chrome(
    executable_path="/<path>/<to>/chromedriver"
    chrome_options=cr_options, )

2.3 省资源模式——不加载视频和图片

python3的爬虫笔记11——Selenium和浏览器的一些设置

3. 用法收集

3.1 iframe 切换

Selenium - iframe 切换 - cjeric的博客 - CSDN博客

3.2 多标签

python Selenium chromedriver 自动化超时报错：你需要使用多标签保护罩护体 | 麋鹿君

3.3 元素加载等待 - EC

要实现目标元素加载完成之后，就马上执行剩下的代码（不要等待整个网页的资源请求和 js 执行完毕），需要在启动浏览器的时候就设置一些必要参数。

from selenium import webdriver
from selenium.webdriver.common import desired_capabilities

## Chrome
capa = desired_capabilities.DesiredCapabilities().CHROME
capa["pageLoadStrategy"] = page_load_strategy

browser = webdriver.Chrome(
    executable_path="/<path>/<to>/chromedriver",
    desired_capabilities=capa,
)

等待某个 button 可以点击实例：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.by import By
from selenium.common.exceptions import TimeoutException

browser.get("<the-link>")
try:
    WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.ID, '<ID-of-the-button>')))
except TimeoutException:
    print("[Warning] wait page loading timeout! maybe need check the networking.")

browser.quit()

3.4 element 操作

3.4.1 获取 `element`

3.4.1.1 xpath 方式

elem = browser.find_element_by_xpath("//div[@id='content_views']")

3.4.1.2 其它方式 - n/a

3.4.2 `element` 内容获取

3.4.2.1 获取 Text - 显示文本

elem.text

3.4.2.2 获取 innerHTML

html = elem.get_attribute('innerHTML')

get_attribute('innerHTML')¹

3.4.2.3 获取 innerText

清洗 HTML 的方式：
清除 tag，只获取文本，使用 BeautifulSoup4 方法：

from bs4 import BeautifulSoup
BeautifulSoup(html).get_text()

注：

content = BeautifulSoup(elem.get_attribute('innerHTML')).get_text()

最终结果和 content = elem.text 几乎等效。

但是可以获取较高层级的标签内部的所有 text。
等效于在浏览器 console 中执行：
$x('//<xpath>//<to>//<tag>')[0]["innterText"]

3.4.2.4 获取 attribute 值（如`<a href="url">` 中的 `url`）

link = elem.get_attribute("href")

3.5 截图保存

a. 整个网页：

chrome 在 Ubuntu 虚拟机上无法保存 -> TimeoutException
Firefox 调用 .save_screenshot('/<path>/<to>/<name>.png') 可以保存当前浏览器窗口显示的截图

b. 指定（元素）块：

elem = browser.find_element_by_xpath(
    '/<xpath>/<to>/<tag>[@class="<class-content>"]'
)
elem_png = elem.screenshot_as_png

## save to file
full_path = "/<path>/<to>/<save>/selenium-elem.png"
with open(full_path, "wb") as fp:
    fp.write(elem_png)