selenium使用

最新推荐文章于 2025-10-13 14:40:36 发布

原创最新推荐文章于 2025-10-13 14:40:36 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#selenium

Python 爬虫专栏收录该内容

21 篇文章

订阅专栏

本文介绍了Selenium作为浏览器自动化工具的概念，环境安装步骤，及其与爬虫的关系，特别是模拟登录和捕获动态数据的能力。Selenium的特点是可见及可得，但效率较低。文章详细阐述了如何准备浏览器驱动程序，使用ActionChains进行动作链操作，并提供了实例代码。同时，讨论了如何通过js注入来规避风险，以及无头浏览器的使用，如PhantomJS和谷歌无头浏览器。

概念：基于浏览器自动化的一个模块。
环境的安装：
- pip install selenium
selenium和爬虫之间的关联：
- 模拟登录
- 便捷的捕获到动态加载的数据（重点）
  - 特点：可见及可得
  - 缺点：效率低
selenium的具体使用
- 准备浏览器的驱动程序：http://chromedriver.storage.googleapis.com/index.html
- 具体代码实现：

from selenium import webdriver
from time import sleep

# 后面是你的浏览器驱动位置，记得前面加r'','r'是防止字符转义的
driver = webdriver.Chrome(r'chromedriver')
# 用get打开百度页面
driver.get("http://www.baidu.com")
# 查找页面的“设置”选项，并进行点击
driver.find_elements_by_link_text('设置')[0].click()
sleep(2)
# # 打开设置后找到“搜索设置”选项，设置为每页显示50条
driver.find_elements_by_link_text('搜索设置')[0].click()
sleep(2)

# 选中每页显示50条
m = driver.find_element_by_id('nr')
sleep(2)
m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()
m.find_element_by_xpath('.//option[3]').click()
sleep(2)

# 点击保存设置
driver.find_elements_by_class_name("prefpanelgo")[0].click()
sleep(2)

# 处理弹出的警告页面   确定accept() 和 取消dismiss()
driver.switch_to_alert().accept()
sleep(2)
# 找到百度的输入框，并输入 美女
driver.find_element_by_id('kw').send_keys('美女')
sleep(2)
# 点击搜索按钮
driver.find_element_by_id('su').click()
sleep(2)
# 在打开的页面中找到“Selenium - 开源中国社区”，并打开这个页面
driver.find_elements_by_link_text('美女_百度图片')[0].click()
sleep(3)

# 关闭浏览器
driver.quit()

js注入

from selenium import webdriver
from time import sleep
#结合着浏览去的驱动实例化一个浏览器对象
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

#请求的发送
url = 'https://www.jd.com/'
bro.get(url)
sleep(1)
#标签定位
# bro.find_element_by_xpath('//input[@id="key"]')
search = bro.find_element_by_id('key')
search.send_keys('mac pro')#向指定标签中录入文本数据
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)

#JS注入
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

#捕获到当前页面的数据
page_text = bro.page_source
print(page_text)
sleep(3)

bro.quit()

动作链：ActionChains，一系列的行为动作

使用流程：
- 实例化一个动作连对象，需要将指定的浏览器和动作连对象进行绑定
- 执行相关的连续的动作
- perform()立即执行动作连制定好的动作
代码实现：(操作的是菜鸟教程)

from selenium import webdriver
from selenium.webdriver import ActionChains#动作连
from time import sleep
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'

bro.get(url)
#NoSuchElementException:定位的标签是存在与iframe之中，则就会抛出这个错误
#解决方法：switch_to.frame进行指定子页面的切换
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')

#实例化一个动作连对象
action = ActionChains(bro)
action.click_and_hold(div_tag)#点击且长按

#perform()让动作连立即执行
for i in range(5):
    action.move_by_offset(xoffset=15,yoffset=15).perform()
    sleep(2)
action.release()
sleep(5)
bro.quit()

selenium规避风险

正经打开一个网站进行window.navigator.webdriver的js注入，返回值为undefined
使用selenium打开的页面，进行上述js注入返回的是true
代码实现：

# 规避检测
from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(executable_path='./chromedriver.exe',options=option)

url = 'https://www.taobao.com/'

bro.get(url)

无头浏览器

phantomJs
谷歌无头
代码实现：

#无头浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options)
url = 'https://www.taobao.com/'
bro.get(url)
sleep(2)
bro.save_screenshot('123.png')

print(bro.page_source)