Python爬虫实战：优美图库美女写真高效下载（附完整代码）

本文链接：https://blog.csdn.net/qq_34666239/article/details/136157753

Python爬虫实战：优美图库美女写真高效下载 📸💃

在这个数字化快速发展的时代，网络上充斥着各式各样的视觉盛宴。其中，优美图库作为一个提供高质量美女写真图片的平台，吸引了众多视觉爱好者的目光。那么，如何能够高效地下载这些美图呢？本文将引导你使用Python结合多线程技术，快速下载这些美丽照片。🚀

工具准备 🔧

在开始我们的爬虫之旅前，请确保你的Python环境中安装了以下库：

requests：用于发送网络请求。
beautifulsoup4：解析HTML页面，提取信息。
threading：实现多线程下载，提高下载效率。

爬虫策略 🕵️‍♂️

我们的爬虫将分为几个关键步骤：

请求主页面：首先访问优美图库的主页面，获取美女写真的列表。
解析详情页URL：使用BeautifulSoup解析出每个写真集的详情页URL。
多线程下载图片：为每张图片的下载创建一个线程，实现并行下载。

在上述介绍的Python爬虫项目中，我们的目标是从优美图库网站下载美女写真图片。项目的实现思路主要分为以下几个关键步骤，以确保我们能够高效、有序地完成整个下载过程。

实现思路详解 🛠️📖

1. 准备工作

在开始编写爬虫代码之前，确保已安装必要的Python库：requests用于发送HTTP请求，BeautifulSoup（bs4）用于解析HTML页面，threading用于实现多线程下载，提高下载效率。

2. 发起请求并解析页面

主页面解析：首先，使用requests库发送GET请求到优美图库的主页面或图片列表页面。然后，利用BeautifulSoup解析得到的HTML内容，找到每个美女写真集的详情页链接和标题。
详情页访问：对于每个写真集的详情页链接，再次使用requests库发起GET请求，并用BeautifulSoup解析页面，提取出该写真集所有图片的直接下载URL。

3. 多线程下载图片

创建目录：根据每个写真集的标题创建相应的目录，用于存放下载的图片。这样做有助于组织和管理下载的文件。
线程下载：对于每个写真集中的每张图片，创建一个独立的下载线程。使用threading.Thread启动多线程下载，每个线程调用download_image函数下载一张图片并保存到之前创建的对应目录中。
同步线程：在一个写真集的所有图片启动下载后，使用线程的join方法等待所有线程完成，确保每个写真集中的所有图片都被下载。

4. 错误处理和日志记录

重试机制：在fetch函数中实现重试逻辑，以应对网络请求失败的情况。使用简单的循环和延时（指数退避策略）来重试失败的请求。
日志记录：在下载过程中，通过打印日志信息来记录下载进度、成功或失败的状态以及任何错误信息。这对于调试和监控爬虫的运行状态非常有用。

5. 尊重版权和合法性

在使用爬虫下载网络资源时，必须考虑版权和合法性问题。只下载那些允许下载的内容，并遵守网站的使用条款。同时，爬虫不应对网站的服务器造成过大压力。

代码解析 🧑‍💻

以下是我们爬虫的魔法书（代码）：

设置请求头和会话

session = requests.Session()
headers = {
    "User-Agent": "Mozilla/5.0 ..."
}
session.headers = headers

解析主页面

使用BeautifulSoup解析主页面，获取每个写真集的详情页URL和标题。

page = BeautifulSoup(index_html, "html.parser")
find_all = page.find_all("li", attrs={"class": "i_list list_n2"})

多线程下载写真集

对于每个写真集，我们在解析到图片URL后，为每张图片创建一个下载线程。

for item in find_all:
    # 解析详情页和图片URL...
    thread = threading.Thread(target=download_image, args=(...))
    thread.start()
    threads.append(thread)

等待所有线程完成

在下载了一个写真集的所有图片后，我们等待所有线程完成，确保每张图片都被正确下载。

for thread in threads:
    thread.join()

完整代码

from bs4 import BeautifulSoup
import requests
import os
import threading
import time


def create_directory(directory_path):
    if not os.path.exists(directory_path):
        os.makedirs(directory_path)
        print(f"目录 '{directory_path}' 已创建。")
    else:
        print(f"目录 '{directory_path}' 已存在。")


def download_image(session, download_url, directory_path, title, num):
    create_directory(directory_path)  # 确保目录已创建
    image_path = f"{directory_path}/{title}_{num}.jpg"
    if not os.path.exists(image_path):
        try:
            img = session.get(url=download_url, headers=headers, timeout=10)
            if img.status_code == 200:
                with open(image_path, "wb") as f:
                    f.write(img.content)  # 保存图片
                print(f"{title}的第{num}张图片下载完成!")
            else:
                print(f"下载失败: {download_url}, 状态码: {img.status_code}")
        except requests.RequestException as e:
            print(f"下载时发生错误: {e}")
    else:
        print(f"{title}的第{num}张图片已存在。")


# 使用Session进行网络请求
session = requests.Session()
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
session.headers = headers

# 1. 确定url
url = "https://www.umei.net/i/index_2.html"
csUrl = "https://www.umei.net"

# 2. 发请求
index = session.get(url=url)
index.encoding = "utf-8"
index_html = index.text

# 解析页面
page = BeautifulSoup(index_html, "html.parser")
find_all = page.find_all("li", attrs={"class": "i_list list_n2"})

for item in find_all:
    href = item.find("a").get("href")  # 获取到了每个图片的详情页的url
    title = item.find("a").get("title")  # 获取标题用来创建文件夹
    detail_url = csUrl + href
    print(f"请求的地址:{detail_url}和标题:{title}")

    num = 1  # 计数
    threads = []  # 线程列表
    while True:
        time.sleep(1)  # 在请求详情页面之前增加延时
        detail_html = session.get(url=detail_url)
        detail_html.encoding = "utf-8"
        ret = BeautifulSoup(detail_html.text, "html.parser")
        div = ret.find("div", attrs={"class": "image_div"})
        next_page = div.find("a").get("href")
        next_url = csUrl + next_page

        if next_page == "/tupian/":
            print(f"没有下一页了,已经下完了!标题:{title}")
            num = 1  # 重置num
            break
        else:
            detail_url = next_url
            download_url = div.find("img").get("src")

            # 创建并启动线程
            thread = threading.Thread(target=download_image,
                                      args=(session, download_url, f"./file/image/{title}", title, num))
            threads.append(thread)
            thread.start()

            num += 1

    # 等待所有线程完成
    for thread in threads:
        thread.join()

    time.sleep(1)  # 每次完成一组图片下载后暂停1秒