Pyhton爬虫使用Selenium实现浏览器自动化操作抓取网页

wdxylb

已于 2024-09-29 10:46:23 修改

阅读量296

点赞数 2

文章标签：爬虫 selenium 测试工具

于 2024-09-28 00:03:45 首次发布

本文链接：https://blog.csdn.net/2301_80892630/article/details/142471403

版权

第三方库Selenium主要是用来抓取动态生成的网页数据，有些网站的内容要下拉网页才会动态加载，特别是那些使用javaScript渲染的内容。当然Selenium还可用于自动化浏览器操作，比如编写一个自动抢火车票的python脚本，这并不难实现。接下了我将通过用Selenium模拟用户的操作来抓取网页。

pip install selenium  # 这是下载Selenium的终端命令

我使用的是edge浏览器，谷歌浏览器也是一样的，不过初始化webdriver用'driver = webdriver.chrome()'，为了简便，我挑选在百度官网上去抓取一些表情包图片。同样我将这些步骤封装成了一个函数，不过因为爬取网页需要分析特定的网页结构，所以能够复用的地方有限，该函数只能用于百度官网，因为我也不确定其他网站的输入框input元素的id是不是"kw"。

首先，我要模拟用户在百度官网上输入“表情包”到搜索框。search_box1.send_keys(search_name)该函数会将search_name的值传递给search_box1（这是获取的页面input输入框元素）。同时用函数search_box1.send_keys(Keys.RETURN)模拟用户的点击回车操作。
这个函数有两个参数，第一个是搜索的内容，可以换成其他的，但因为该函数要抓取的是图片URL，所以范围也是有限。第二个是默认参数，默认值为'https://www.baidu.com'。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

def get_image_urls(search_name="表情包",url="https://www.baidu.com"):

    # 初始化 WebDriver（以 Edge 为例）
    driver = webdriver.Edge()
    driver.get(url)
    wait = WebDriverWait(driver, 10)

    search_box1 = wait.until(EC.element_to_be_clickable((By.ID, "kw")))
    search_box1.send_keys(search_name)  # 将值为"表情包"的字符串变量传递
    time.sleep(2) 
    search_box1.send_keys(Keys.RETURN)  # 模拟用户输入回车

在百度搜索内容时，默认会将搜索结果中的标题以<h3>标签显示。我这里定位到搜索结果的第一个，并用first_a.click()模拟用户点击链接跳转。但是要注意的是由于点击链接会跳转到新窗口，所以我们要用switch_to.windows来切换页面。由于本文旨在讨论Selenium的自动化操作，涉及爬虫的内容并不会很多，所以在接下来的抓取图片URL的操作，只是简单抓取可见的img标签，对于可能还未被加载的图片不在过多讨论。

     # 等待搜索结果的加载，并寻找第一个搜索结果
    first_h3 = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3")))
    print(first_h3.text)

    # 查找 h3 内部的第一个 a 标签
    first_a = first_h3.find_element(By.TAG_NAME, "a")  # 找到第一个子标签 a
    time.sleep(2) 

    # 点击 a 标签以跳转
    first_a.click()

    # 等待一些时间以观察跳转效果（可选）
    time.sleep(2)
    driver.switch_to.window(driver.window_handles[1])  # 切换到新窗口

因为我们访问的网站能够实现无限滚动加载，所以我们会在外部用while循环来控制下拉次数。并使用max_scrolls = 10和scroll_count = 0搭配使用来控制下拉次数为10次。

    # 这里将模拟用户下拉操作十次
    max_scrolls = 10
    scroll_count = 0

    while scroll_count < max_scrolls:
        last_height = driver.execute_script("return document.body.scrollHeight")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        new_height = driver.execute_script("return document.body.scrollHeight")
            
        if new_height == last_height:
            break
        scroll_count += 1
    img_tags = driver.find_elements(by=By.ID, value='imgid')
    print("找到的图像标签数:", len(img_tags))  # 打印找到的图像标签数量

    for img in img_tags:
        imgitems = img.find_elements(by=By.CLASS_NAME, value='imgitem')
        for item in imgitems:
            print(item.text)
    driver.quit()

以上所有代码都是在函数get_image_urls()里，不过在while循环里调用了我们自己编写的滚动函数scroll_to_bottom()，这是模拟用户往下拉加载的函数，该函数在上面的程序中被调用。

def scroll_to_bottom(driver):
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    while True:
        # Scroll down to the bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
        # Wait for new content to load
        time.sleep(2)  # 可以根据实际情况调整等待时间
        
        # Calculate new scroll height and compare with last height
        new_height = driver.execute_script("return document.body.scrollHeight")
        
        if new_height == last_height:
            break  # 如果没有新的内容加载，退出循环
        last_height = new_height

本程序涉及使用url地址初始化一个浏览器实例，浏览器自动化操作使用send_keys()来模拟用户的输入以及按键（如回车键），同时使用click()函数来模拟用户点击链接，并用法都不难。以上程序在下载Selenium后就可以直接运行，如果有问题的欢迎在评论区留言！