Selenium上手

最新推荐文章于 2023-04-24 22:06:25 发布

Silence~123

最新推荐文章于 2023-04-24 22:06:25 发布

阅读量274

点赞数

分类专栏： Selenium 文章标签： Selenium 验证码识别

本文链接：https://blog.csdn.net/Venry_/article/details/88725483

版权

Selenium 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

What

selenium 是一套完整的web应用程序测试系统，包含了测试的录制（selenium IDE）,编写及运行（Selenium Remote Control）和测试的并行处理（Selenium Grid）。

就是相当于按键精灵，可以帮助你完成浏览器的操作，模拟浏览器。

Why

因为有些页面需要 ajax 后才显示出真正的内容。

HOW

前期准备：

install selenium 模块，python 导入就行。
下载 geckodriver 驱动：下载地址
驱动解压后取其路径: …/geckodriver.exe

编码：

from selenium import webdriver

browser = webdriver.Firefox(executable_path="D:/Python36/geckodriver.exe")
browser.get(url="http://www.baidu.com")
print(browser.page_source)
browser.close()

运行后会模拟打开浏览器，进入百度界面，最后关闭。

这也是 selenium 的特点，要搭配第三方浏览器才行，而且会出现浏览器界面，但是对我们程序员来说，这没有必要，所以有了 PhantomJS 这是一个无界面的浏览器，它会把网站加载到内存并执行页面上的JavaScript，因为不会展示图形界面，所以运行起来比完整的浏览器更高效。PhantomJS 下载地址。

browser = webdriver.PhantomJS( executable_path="D:\\Python36\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe")

运行后发现，出错：

UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

意思是 selenium 已经不支持 phantomjs 了，建议用无头版本的火狐或者谷歌。

具体代码：

from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox( executable_path="D:/Python36/geckodriver.exe",firefox_options=options)

与之前的相比，多了无头的配置。这是无头 Firefox 的配置，驱动还是使用 geckodriver。

如果运行后发现错误，如下：

selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

原因是版本问题，可以更新 Firefox 到最新版本。

与之相对，谷歌浏览器，则要使用 chromdriver 驱动,选择版本下载，我下载的目前最新的。

代码如下

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('-headless')
browser = webdriver.Chrome( executable_path="D:\\Python36\\chromedriver_win32\\chromedriver.exe",chrome_options=options)

页面操作

定位元素

获取 html 中的元素，有以下方法。

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_eelement_by_tag_name
find_element_by_class_name
find_element_by_css_selector

如百度输入框：

<input type="text" class="s_ipt" name="wd" id="kw" maxlength="100" autocomplete="off">

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('-headless')
browser = webdriver.Chrome( executable_path="D:\\Python36\\chromedriver_win32\\chromedriver.exe",chrome_options=options)
browser.get(url="http://www.baidu.com")
input = browser.find_element_by_id('kw')

我们可以这样获取元素，其他获取方法如其方法名一般意义，如果想要其对输入框输入内容可以使用以下方法：

input.send_keys('selenium')

清空则用：

input.clear()

鼠标动作

在页面中模拟鼠标的动作，包括单击，双击，拖动等等，要导入ActionChains类。

<input type="submit" value="百度一下" id="su" class="btn self-btn bg s_btn">

这是百度，输入框旁边的搜索按钮。

button = browser.find_element_by_id('su')

鼠标移动到某个元素上，如button：

mouse = ActionChains(browser)
mouse.move_to_element(button).perform()

鼠标单击：此时鼠标已移动到 button 元素上

mouse.click().perform()

或者不用该类的方法直接用：

button.click()

鼠标双击：

mouse.double_click().perform()

鼠标右击：

mouse.double_click().perform()

鼠标单击并保持不动：

mouse.click_and_hold().perform()

鼠标拖拽：元素之间

mouse.drag_and_drop(input,button).perform()

其他操作详情

尝试一下模拟用户登录，并识别验证码，验证登录，我选取的是本校的图书馆系统。

用户名和密码好搞定，但是验证码就没那么好搞定了，想到使用 OCR 技术，图文识别，准备想用 Tesseract 来识别验证码，但是识别不出来，后来用了阿里的也不行，最后采用了百度的api，才成功。

首先获取验证码图片，可以截取整个页面，然后再从中截取验证码图片。

from PIL import Image
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

image_name = "temp.png"
browser.get_screenshot_as_file(image_name)
image = Image.open(image_name)
box = (221,265,266,284)
region = image.crop(box)
verify_name = "v.png"
region.save(verify_name)

这样就可以了，box 是左上右下，坐标，可以用系统画图工具，获取，记住要用无头的浏览器，对其截取验证码，不然截取得位置会不对，我猜测可能是打开了浏览器窗口但不是全屏，故有所偏差。

然后使用了百度api ，详细代码看百度api 文档调用就行，登录进了我的图书馆,并且获取了其中的信息，具体信息我就不透露了，嘿嘿。

大家可以尝试下其他网站，挑战下自己。

Silence~123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Selenium上手

Whatselenium 是一套完整的web应用程序测试系统，包含了测试的录制（selenium IDE）,编写及运行（Selenium Remote Control）和测试的并行处理（Selenium Grid）。就是相当于按键精灵，可以帮助你完成浏览器的操作，模拟浏览器。Why因为有些页面需要 ajax 后才显示出真正的内容。HOW前期准备：install selenium 模...
复制链接

扫一扫