Python-爬虫之selenium

最新推荐文章于 2024-08-07 16:07:19 发布

「已注销」

最新推荐文章于 2024-08-07 16:07:19 发布

阅读量165

点赞数

文章标签： python selenium

BuGu

本文链接：https://blog.csdn.net/weixin_51800059/article/details/122538415

版权

一、介绍

'''
    selenium的介绍
        -- 是一个用于Web应用程序测试的工具
        -- 直接运行在浏览器中，就像真正的用户在操作一样
        -- 支持各种驱动(driver)
        -- 支持无界面浏览器操作
        -- 模拟浏览器功能，自动执行JS代码
                
 	最重要的特点：
 		-- 1. 可以在浏览器总直接执行JS代码
 		-- 2. 可以模仿人在浏览器进行一系列的操作
 		
    为什么使用？
        -- 有时候模拟浏览器去获取数据，会获取不到
        -- 此时就需要有驱动帮助我们，用一个驱动去操作另一个驱动
'''

二、安装

'''
    安装selenium：【仅针对Chrom浏览器】
        1. 查看当前Chrom浏览器的版本：浏览器右上角 --> 帮助 --> 查看版本
        2. http://chromedriver.storage.googleapis.com/index.html下载浏览器对应版本的驱动
        3. 下载后是一个zip压缩包，解压得到一个 *.exe文件
        4. 将 *.exe文件放到与python文件同级的目录下
        5. Python安装selenium库：pip install selenium
        6. 导包：from selenium import webdriver
'''

三、简单使用

# 1. 导入：from selenium import webdriver
from selenium import webdriver

# 2. 创建浏览器操作对象
# 这里是刚才下载的浏览器驱动的路径，因为我把他放在了python文件的同级目录下，所以直接写exe文件的名称即可
path = 'chromedriver.exe' 
browser = webdriver.Chrome(path)

# 3. 打开浏览器访问网站
url = 'https://www.baidu.com'

# 4. 访问网站：会自动打开浏览器
browser.get(url=url)

# 5. 获取网页源码：可以获取到模拟浏览器获取不到的东西
content = browser.page_source

四、元素定位

# 1. 导入：from selenium import webdriver
from selenium import webdriver

# 2. 创建浏览器操作对象
path = 'chromedriver.exe'
browser = webdriver.Chrome(path)

# 3. 打开浏览器访问网站
url = 'https://www.baidu.com'

# 4. 访问网站：会自动打开浏览器
browser.get(url=url)


# -------------------------------------------------------
# 元素定位，获取对应的 WebElement对象


# 根据 id
a = browser.find_element_by_id('su')
print(type(a))

# 通过 xpath 返回一个，返回第一个符合的
a = browser.find_element_by_xpath('//div')
print(a)

# 通过 xpath 返回多个
a = browser.find_elements_by_xpath('//div')
print(type(a))

# 根据 css选择器 返回一个，返回第一个符合的
a = browser.find_element_by_css_selector('#su')
print(a)

# 根据 css选择器 返回多个，返回list
a = browser.find_elements_by_css_selector('#su')
print(a)

# 根据<a>标签的文本，获取对象
a = browser.find_element_by_link_text('hao123')
print(a)

五、获取WebElement对象的属性

# 1. 导入：from selenium import webdriver
from selenium import webdriver

# 2. 创建浏览器操作对象
path = 'chromedriver.exe'
browser = webdriver.Chrome(path)

# 3. 打开浏览器访问网站
url = 'https://www.baidu.com'

# 4. 访问网站：会自动打开浏览器
browser.get(url=url)

# 获取对象
input = browser.find_element_by_id('kw')
print(input)

# 获取对象的 属性
print(input.get_attribute('class'))
print(input.get_attribute('type'))
print(input.get_attribute('name'))

# 获取对象的 标签名
print(input.tag_name)

# 获取对象的 文本
print(input.text)

六、程序与浏览器的自动交互

from selenium import webdriver
import time

path = 'chromedriver.exe'
browser = webdriver.Chrome(path)
url = 'https://www.baidu.com'
browser.get(url=url)

# ---------------------------------------------

# 声明休息时间
t = 3

# 打开浏览器访问网站
browser.get(url)

time.sleep(t)

# 获取文本框的对象
input = browser.find_element_by_id('kw')
# 输入内容：发送一个关键字
input.send_keys('周杰伦')

time.sleep(t)

# 获取百度一下
button = browser.find_element_by_id('su')
# 点击百度一下
button.click()

time.sleep(t)

# 截图
browser.save_screenshot('baidu.jpg')

time.sleep(t)

# 执行js代码
js= 'scroll(0,100000)'
browser.execute_script(js)

time.sleep(t)

# 获取下一页：不要忘记xpath语法，'@'
nextPage = browser.find_element_by_xpath('//a[@class="n"]')
# 点击下一页
nextPage.click()

time.sleep(t)

# 返回到上一个页面
browser.back()

time.sleep(t)

# 回到下一个页面
browser.forward()

time.sleep(t)

# 退出
browser.quit()

七、无界面操作

1. 无界面的介绍

无界面：即不打开浏览器，进行操作

'''
    selenium的的效率较低，因为打开浏览器会，加载页面，会消耗时间

    Phantomjs：【不推荐】
        一种浏览器无界面模式，可以不打开 UI 界面的情况下使用 Chrom浏览器
        支持JS执行
        不进行css渲染，gui渲染，运行效率高

    Chrom handless【官方的东西，推荐】
        一种浏览器无界面模式，可以不打开 UI 界面的情况下使用 Chrom浏览器
        支持JS执行
        是 Google 针对 Chrom浏览器59版本新增的一种模式
        效果一样，但是效率更高
'''

2. Chrom handless简单使用

from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# 一系列的基础配置
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# path是你自己的 "Chrome浏览器" 的路径
path = 'C:\Program Files\Google\Chrome\Application\chrome.exe'
chrome_options.binary_location = path


# 获取操作浏览器的驱动对象
browser = webdriver.Chrome(chrome_options=chrome_options)

# 打开网站
url = 'https://www.baidu.com'
browser.get(url)

# 截图，并保持，可以证明已经访问了网站，但是没有打开浏览器
browser.save_screenshot('baidu.jpg')

3. 简单的封装

from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# 封装 Chrom handless，获取
def getBrowser():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    # path是你自己的 "Chrome浏览器" 的路径
    path = 'C:\Program Files\Google\Chrome\Application\chrome.exe'
    chrome_options.binary_location = path
    return webdriver.Chrome(chrome_options=chrome_options)


# 获取操作浏览器的驱动对象
browser = getBrowser()

# 打开网站
url = 'https://www.baidu.com'
browser.get(url)

# 截图，并保持，可以证明已经访问了网站，但是没有打开浏览器
browser.save_screenshot('baidu.jpg')