使用Selenium爬取微博用户图片

小酋仍在学习

已于 2024-06-21 17:45:04 修改

阅读量613

点赞数 9

文章标签： python

于 2024-06-21 17:35:49 首次发布

本文链接：https://blog.csdn.net/qq_47753695/article/details/139866699

版权

前言

这是一个使用Selenium自动化从指定的微博用户页面中爬取图片并下载的Python脚本。Selenium被选用的主要原因在于其能够有效处理动态加载的内容，实现对复杂Web交互的自动化操作，同时通过并发下载和灵活的控制机制提高爬取效率和适应性。

代码

使用前提：

一、更换url

url页面必须包含以下页面，因为是代码逻辑就是对每个标签（“精选”，“微博”，“视频”，“相册”）进行处理。只能爬取博主页面（“精选”，“微博”，“视频”，“相册”）的所有jpg图片。

二、修改本地存储的文件夹。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib.request
import os
import time
from concurrent.futures import ThreadPoolExecutor

# 创建保存图像的主目录，换成你要保存的文件夹
save_dir = "D:/WeiboSrc/"
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 设置浏览器选项
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36')

# 初始化WebDriver
driver = webdriver.Chrome(options=options)
driver.get("url地址，更换成你要爬取的url地址。。。。。。。。。")

# 等待页面加载
time.sleep(5)

# 定义爬取函数
def crawl_images():
    # 滚动页面以加载更多图片
    scroll_pause_time = 1  # 调整滚动等待时间
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # 向下滚动页面
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause_time)
        
        # 计算新的滚动高度并比较与最后的滚动高度
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # 获取所有图片的URL
    images = driver.find_elements(By.TAG_NAME, 'img')
    img_urls = []
    for img in images:
        img_url = img.get_attribute('src')
        if img_url and img_url.endswith('.jpg'):
            # 替换URL中常见的分辨率标记为'/large/'以获取高分辨率图片
            high_res_url = (img_url.replace('/orj360/', '/large/')
                            .replace('/orj480/', '/large/')
                            .replace('/mw690/', '/large/')
                            .replace('/bmiddle/', '/large/'))
            img_urls.append(high_res_url)
    return img_urls

# 定义标签名称
tabs = ["精选", "微博", "视频", "相册"]

# 循环点击标签并爬取图片
for tab in tabs:
    try:
        # 创建标签对应的子文件夹
        tab_dir = os.path.join(save_dir, tab)
        if not os.path.exists(tab_dir):
            os.makedirs(tab_dir)

        tab_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, f"//span[text()='{tab}']"))
        )
        ActionChains(driver).move_to_element(tab_element).click().perform()
        time.sleep(3)  # 等待页面加载

        img_urls = crawl_images()

        # 并行下载图片函数
        def download_image(idx, img_url):
            try:
                img_path = os.path.join(tab_dir, f"image_{idx+1}.jpg")
                urllib.request.urlretrieve(img_url, img_path)
                print(f"Saved {img_path}")
            except Exception as e:
                print(f"Failed to save {img_url}: {e}")

        # 使用线程池并行下载图片
        with ThreadPoolExecutor(max_workers=20) as executor:
            for idx, img_url in enumerate(img_urls):
                executor.submit(download_image, idx, img_url)
    except Exception as e:
        print(f"Failed to load tab {tab}: {e}")

# 关闭WebDriver
driver.quit()