【Python】多线程爬取某站高颜值小姐姐照片（共1.62GB）

最新推荐文章于 2021-06-05 21:17:23 发布

Xavier Jiezou

最新推荐文章于 2021-06-05 21:17:23 发布

阅读量1.5k

点赞数 2

分类专栏： python 文章标签： python 爬虫多线程

本文链接：https://blog.csdn.net/qq_42951560/article/details/116209658

版权

python 专栏收录该内容

159 篇文章

订阅专栏

本文介绍了如何使用Python爬虫技术抓取唯美女生网站上的1363篇文章中的17601张高清图片，实现多线程下载，最终收集了约1.62GB的图片资源。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

写在前面

本文使用Python编写爬虫脚本，实现多线程爬取唯美女生网站高颜值小姐姐的所有照片。

目标网站

唯美女生：https://www.vmgirls.com/

在这里插入图片描述

依赖模块

pip install requests
pip install BeautifulSoup4
pip install fake_useragent
pip install tqdm

requests：对网页发送HTTP请求并获取响应结果。
BeautifulSoup4：网页元素定位及解析。
fake_useragent：生成随机、伪造的用户代理。
tqdm：下载进度条打印

爬虫思路

我们的目的是爬取该网站的所有小姐姐图片。而该网站的妹子图片是在发的每篇文章里面，要先找到文章链接，才能爬取图片。

一般好的网站都会做一个站点地图，该站点地图里面会包含发布过的所有历史文章标题及链接。幸运的是找到了该网站的站点地图。

然后从站点地图获取发布过的所有文章标题及链接，文章标题作为图片保存文件夹，从文章链接获取图片地址并保存到本地。

截止2021年4月28日，唯美女生网站总计发布文章1363篇。为了提高爬取速度，用多线程技术来分别爬取每篇文章链接及标题。

唯美女生->站点地图：https://www.vmgirls.com/sitemap.html

在这里插入图片描述

完整代码

Github：https://github.com/XavierJiezou/python-vmgirls-crawl

import os
import time
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
import concurrent.futures as cf
from fake_useragent import UserAgent


class VmgirlsDownloader():
    def __init__(self):
        self.root = 'vmgs'
        os.makedirs(self.root, exist_ok=True)
        self.site = 'https://www.vmgirls.com/'
        self.sitemap = 'https://www.vmgirls.com/sitemap.html' # 从站点地图爬取文章列表
        self.headers = {'referer': self.site, 'user-agent': UserAgent().random}
        self.page()
        self.main()

    def page(self):
        resp = requests.get(self.sitemap, headers=self.headers)
        time.sleep(5)
        soup = BeautifulSoup(resp.content, 'lxml')
        temp = soup.select('h3 + ul li a') # 定位文章列表
        articles = []
        temp_dict = {}
        for item in temp:
            href = self.site+item.get('href')
            title = item.get('title')
            if temp_dict.get(title) == None:
                temp_dict[title] = 1
            else:
                temp_dict[title] += 1
                title += str(temp_dict[title]) # 重复文件夹的命名方式
            os.makedirs(os.path.join(self.root, title), exist_ok=True)
            articles.append([href, title])
        self.articles = articles

    def save(self, img_link, img_path):
        resp = requests.get(img_link, headers=self.headers)
        time.sleep(3)
        with open(img_path, 'wb') as f:
            f.write(resp.content)

    def down(self, article_link, article_title):
        resp = requests.get(article_link, headers=self.headers)
        time.sleep(5)
        soup = BeautifulSoup(resp.content, 'lxml')
        imgs = soup.select('div.nc-light-gallery img') # 定位文章里面的所有图片
        name = 1 
        for item in tqdm(imgs, desc=article_title):
            if 'https:' not in item.get('src'):
                img_link = 'https:'+item.get('src')
            else:
                img_link = 'https:'+item.get('srcset').split(' ')[0]
            img_path = f'{self.root}/{article_title}/{name}.{img_link.split(".")[-1]}'
            if not os.path.exists(img_path):
                self.save(img_link, img_path)
                name += 1
            else:
                continue

    def main(self):
        with cf.ThreadPoolExecutor() as tp:
            for article_link, article_title in self.articles:
                tp.submit(self.down, article_link, article_title)


if __name__ == '__main__':
    VmgirlsDownloader()

爬虫结果

1.62GB小姐姐图片下载：微软云盘 | 百度网盘（提取码：2233） | 天翼云盘

项目名称	具体描述
目标网站	https://www.vmgirls.com/ (唯美女生)
爬取日期	2021年4月28日
图片总数	17601张
图片大小	1,742,902,332字节 (约1.62GB)
图片类型	png、jpg和jpeg

单图预览

在这里插入图片描述

多图预览

在这里插入图片描述

引用参考

https://github.com/psf/requests
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
https://github.com/hellysmile/fake-useragent
https://github.com/tqdm/tqdm