selenium 笔记

云满笔记

已于 2022-12-10 13:53:46 修改

阅读量869

点赞数

分类专栏： node 文章标签： selenium python webdriver crawler firefox

于 2022-12-05 14:00:43 首次发布

本文链接：https://blog.csdn.net/wan212000/article/details/128185556

版权

node 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

这里写目录标题

selenium 笔记

selenium 笔记

selenium 是什么

Selenium 是一套完整的 Web 应用程序测试系统, 包含了测试的录制 (Selenium IDE), 编写及运行 (Selenium Remote Control) 和测试的并行处理 (Selenium Grid)。

Selenium 的核心 Selenium Core 基于 JsUnit, 完全由 JavaScript 编写, 因此可以用于任何支持 JavaScript 的浏览器上。

Selenium 可以模拟真实浏览器, 自动化测试工具, 支持多种浏览器, 爬虫中主要用来解决 JavaScript Selenium 用于爬虫时, 相当于模拟人操作浏览器。

具有以下特点:

免费开源: 免费开源, 对商业用户也没有任何限制;
支持多语言: C、 java、ruby、python、或都是 C# , 你都可以通过 selenium 完成自动化测试;
支持多平台: windows、linux、MAC;
支持多浏览器: ie、ff、safari、opera、chrome;
分布式: 可以把测试用例分布到不同的测试机器的执行, 相当于分发机的功能

Selenium 是什么?

Selenium: Web 自动化测试工具集, 包括 IDE、Grid、RC(selenium 1.0)、WebDriver(selenium 2.0) 等。
Selenium IDE: Firefox 浏览器的一个插件。提供简单的脚本录制、编辑与回放功能。
Selenium Grid: 是用来对测试脚步做分布式处理。现在已经集成到 Selenium Server 中了。

例子

基础使用方法

声明浏览对象

from selenium import webdriver

# 构造模拟浏览器
# firefox_login=webdriver.Ie()   # Firefox()
firefox_login=webdriver.Chrome()

这一步可设定无界面模式, 即操作浏览器时, 隐层浏览器

options = webdriver.ChromeOptions()
options.add_argument('--headless')      # 设置无界面  可选

firefox_login=webdriver.Chrome(chrome_options=options)

访问页面

firefox_login.get('http://www.renren.com/')
# firefox_login.maximize_window()　　# 窗口最大化, 可有可无, 看情况
firefox_login.minimize_window()

查找元素并交互

firefox_login.find_element_by_id('email').clear()
firefox_login.find_element_by_id('email').send_keys('xxx@sina.com')

元素查找方法汇总

find_element_by_name
find_element_by_id
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

以上是单元素查找, 多元素把 element 变成 elements 即可。

还有一种较通用的方法

from selenium.webdriver.common.by import By    注意这里要导入

browser = webdriver.Chrome()
browser.get("http://www.taobao.com")

input_first = browser.find_element(By.ID,"q")    ID 可以换成其他

操作浏览器

firefox_login.find_element_by_id('login').click()

可将操作放入动作链中串行执行

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = "http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable"
browser.get(url)
# 
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

上面实现了一个元素拖拽的功能

执行 js 命令

直接用 js 命令操作浏览器

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("http://www.zhihu.com/explore")
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')

输出并关闭

print(firefox_login.current_url)
print(firefox_login.page_source)

# 浏览器退出
# firefox_login.close()
firefox_login.quit()

获取元素属性: get_attribute('class')

logo = browser.find_element_by_id('zh-top-link-logo')
print(logo.get_attribute('class'))

获取文本: logo.text
获取 id: logo.id
获取位置: logo.location
获取标签名: logo.tag_name
获取 size: logo.size

方法进阶

除了基础的操作外, 还有很多特殊的应用场景需要处理。

1.frame 标签
很多网页中存在 frame 标签, 要处理 frame 里面的数据, 首先要切入 frame, 处理完了还要切出来。

切入用 switch_to.frame, 切出用 switch_to.parent_frame

示例

# encoding:utf-8

import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')     # iframeResult 是 iframe 的 id       进入 frame
source = browser.find_element_by_css_selector('#draggable')
print(source)
try:
    logo = browser.find_element_by_class_name('logo')
except NoSuchElementException:
    print('NO LOGO')
browser.switch_to.parent_frame()        # 退出 frame
logo = browser.find_element_by_class_name('logo')
print(logo)
print(logo.text)

等待
在操作浏览器时经常要等待, selenium 也有等待方法, 分为显式等待和隐式等待

from selenium import webdriver

browser = webdriver.Chrome()
browser.implicitly_wait(100)　　　　# 
browser.get('https://www.zhihu.com/explore')
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)

显式等待

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get('https://www.taobao.com/')
wait = WebDriverWait(browser, 100)　　　　# 
input = wait.until(EC.presence_of_element_located((By.ID, 'q')))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)

显式等待和隐式等待都是无阻塞的, 即响应就继续, 不同的是, 显示等待需要设定响应条件, 如获取某元素。

常用判断条件

title_is: 判断当前页面的 title 是否等于预期
title_contains: 判断当前页面的 title 是否包含预期字符串
presence_of_element_located: 判断某个元素是否被加到了 dom 树里, 并不代表该元素一定可见
visibility_of_element_located: 判断某个元素是否可见。可见代表元素非隐藏, 并且元素的宽和高都不等于 0
visibility_of: 跟上面的方法做一样的事情, 只是上面的方法要传入 locator, 这个方法直接传定位到的 element 就好了
presence_of_all_elements_located: 判断是否至少有 1 个元素存在于 dom 树中。举个例子, 如果页面上有 n 个元素的 class 都是'column-md-3', 那么只要有 1 个元素存在, 这个方法就返回 True
text_to_be_present_in_element: 判断某个元素中的 text 是否 包含 了预期的字符串
text_to_be_present_in_element_value: 判断某个元素中的 value 属性是否包含了预期的字符串
frame_to_be_available_and_switch_to_it: 判断该 frame 是否可以 switch 进去, 如果可以的话, 返回 True 并且 switch 进去, 否则返回 False
invisibility_of_element_located: 判断某个元素中是否不存在于 dom 树或不可见
element_to_be_clickable - it is Displayed and Enabled: 判断某个元素中是否可见并且是 enable 的, 这样的话才叫 clickable
staleness_of: 等某个元素从 dom 树中移除, 注意, 这个方法也是返回 True 或 False
element_to_be_selected: 判断某个元素是否被选中了, 一般用在下拉列表
element_located_to_be_selected
element_selection_state_to_be: 判断某个元素的选中状态是否符合预期
element_located_selection_state_to_be: 跟上面的方法作用一样, 只是上面的方法传入定位到的 element, 而这个方法传入 locator
alert_is_present: 判断页面上是否存在 alert

更多参考: WebDriver API

wait.until(EC.text_to_be_present_in_element_value(('id', 'inputSearchCity'), u'西安'))

浏览器的前进后退

forward/back

import time
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.baidu.com/')
browser.get('https://www.taobao.com/')
browser.back()
time.sleep(1)
browser.forward()
browser.close()

6.cookie 操作

get_cookies()

delete_all_cookies()

add_cookie()

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'zhaofan'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())

选项卡管理

暂略

异常处理

暂略

参考资料

英文官方教程: https://selenium-python.readthedocs.io/
Webdriver API: https://selenium-python.readthedocs.io/api.html
pdf 电子书: 《Python 爬虫开发与项目实战》
很好的教程: http://www.cnblogs.com/zhaof/p/6953241.html
等待: https://www.jianshu.com/p/47853fdb613b
等待实例: https://blog.csdn.net/qq_38316655/article/details/81989232

问题

加载 cookie 报错问题: selenium.common.exceptions.InvalidCookieDomainException: Message: invalid cookie domain

在使用 selenium 进行自动化登录的过程中已经获取到 cookie 后, 依旧报错: selenium.common.exceptions.InvalidCookieDomainException: Message: invalid cookie domain

获取 cookie 和添加 cookie 原代码如下:

# 获取 cookie

dr = webdriver.Chrome("D:softwareProBrowserDriverchromedriver.exe")
dr.maximize_window()
dr.get(url)
c = dr.get_cookie('JSESSIONID')
print(c)

# 添加 cookie

dr = webdriver.Chrome("D:softwareProBrowserDriverchromedriver.exe")
dr.maximize_window()
dr.add_cookie({'domain': '192.168.2.211', 'httpOnly': True, 'name': 'JSESSIONID', 'path': '/smartcommty', 'sameSite': 'Lax', 'secure': False, 'value': '5574c24a-dbc4-4a7d-9607-cc24f5653ebf'})
dr.get(url)
dr.refresh()

得到的页面一直是域名为 data 的白页面。

经过网上查找资料, 自我分析得知: selenium 的默认域名为 data, cookie 中带域名, 在设置 cookie 时发现当前域名不包含在 cookie 中, 所以设置失败, 一直都是 data 的这个页面。

解决方法就是: 在设置 cookies 前, 先访问需要登录的地址, 然后设置 cookies 登录跳转, 就 OK 了。

如下:

# 添加 cookie

dr = webdriver.Chrome("D:softwareProBrowserDriverchromedriver.exe")
dr.maximize_window()
dr.get(url)
dr.add_cookie({'domain': '192.168.2.211', 'httpOnly': True, 'name': 'JSESSIONID', 'path': '/smartcommty', 'sameSite': 'Lax', 'secure': False, 'value': '5574c24a-dbc4-4a7d-9607-cc24f5653ebf'})
dr.get(url)
dr.refresh()

启动 selenium 时报异常: selenium.common.exceptions.WebDriverException: Message: Expected browser binary location, but unable to find binary in default location, no ‘moz:firefoxOptions.binary’ capability provided, and no binary flag set on the command line

在创建 webdriver 时, 将 firefox.exe 可执行文件路径当做参数传递给 webdriver。如下:

启动 selenium 时报错如下异常: selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

把 geckodriver 加入环境变量;

选中需要 scorll 的复选框

Error message: selenium.common.exceptions.ElementNotInteractableException: Message: Element <option> could not be scrolled into view.

solution 1

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

mySelectElement = browser.find_element_by_id('providerTypeDropDown')
dropDownMenu = Select(mySelectElement)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='providerTypeDropDown']//options[contains(.,'Professional')]")))
dropDownMenu.select_by_visible_text('Professional')

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

mySelectElement = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "providerTypeDropDown")))
mySelectElement.click()

solution 2

If your use case is to validate the presence of any element, you need to induce WebDriverWait setting the expected_conditions as presence_of_element_located() which is the expectation for checking that an element is present on the DOM of a page. This does not necessarily mean that the element is visible. So the effective line of code will be:

WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".reply-button"))).click()

If your use case is to extract any attribute of any element you need to induce WebDriverWait setting the expected_conditions as visibility_of_element_located(locator) which is an expectation for checking that an element is present on the DOM of a page and visible. Visibility means that the element is not only displayed, but it also has a height and width that is greater than 0. So in your use case, effectively the line of code will be:

email = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "element_css"))).get_attribute("value")

If your use case is to invoke click() on any element you need to induce WebDriverWait setting the expected_conditions as element_to_be_clickable() which is an expectation for checking an element is visible and enabled such that you can click it. So in your use case, effectively the line of code will be:

WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".reply-button"))).click()

References

You can find a couple of detailed discussion in:

云满笔记

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
selenium 笔记

Selenium 是一套完整的 Web 应用程序测试系统, 包含了测试的录制 (Selenium IDE), 编写及运行 (Selenium Remote Control) 和测试的并行处理 (Selenium Grid)。Selenium 的核心 Selenium Core 基于 JsUnit, 完全由 JavaScript 编写, 因此可以用于任何支持 JavaScript 的浏览器上。Selenium 可以模拟真实浏览器, 自动化测试工具, 支持多种浏览器, 爬虫中主要用来解决 JavaScript
复制链接

扫一扫