Python采集一下cosplay图片，满足你爱看美人的要求~-CSDN博客

本文链接：https://blog.csdn.net/python56123/article/details/123662569

本次目的:

Python采集cosplay图片

知识点:

1、系统分析目标网页
2、html标签数据解析方法
3、海量图片数据一键保存

第三方模块:

requests >>> pip install requests
parsel >>> pip install parsel

环境介绍:

python 3.8
pycharm 2021专业版 >>> 激活码

分析网站(思路分析)

网站当中所有的图片我都要

获取所有相册详情页链接
访问他然后拿到所有的图片

实现代码

发送请求获取数据解析数据保存数据

发送请求(访问)
获取数据(网页源代码)
解析数据
发送请求
获取数据
解析数据
保存数据
a. 数字怎么加
b. 数字加多少

开始我们的代码

首先导入模块

import requests     # 发送请求
import parsel       # 解析数据 工具
import re

伪装

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'
}
for page in range(1, 7):
    url = f'http://www.cosplay8.com/pic/xiezhen/list_224_{page}.html'

1. 发送请求(访问)

    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    # <Response [200]>: 响应成功

2. 获取数据(网页源代码)

    #   a. 结构化数据
    #       json
    #   b. 非结构化数据
    #       html
    html_data = response.text

3. 解析数据

    # <div></div>  <a></a> <p></p> <img />
    selector = parsel.Selector(html_data)
    # ::attr(href): 获取到 标签属性为href内容
    url_list = selector.css('.txtover::attr(href)').getall()
    title_list = selector.css('.txtover::attr(title)').getall()
    zip_data = zip(title_list, url_list)
    for title, sub_url in zip_data:
        link = '网址' + sub_url
        print(title, link)

4. 发送请求

        response_1 = requests.get(link, headers=headers)
        response_1.encoding = 'utf-8'

5. 获取数据

        html_data_1 = response_1.text
        page_num = re.findall('共(.*?)页', html_data_1)[0]

6. 解析数据

        selector_1 = parsel.Selector(html_data_1)
        # id选择器
        sub_img = selector_1.css('#bigimg::attr(src)').get()
        img_list = []
        img_url_ = '网址' + sub_img
        img_list.append(img_url_)
        for page in range(2, int(page_num) + 1):
            detail_sub = link.replace('.html', '')
            detail_url = detail_sub + '_' + str(page) + '.html'
            detail_html = requests.get(detail_url, headers=headers).text
            selector_2 = parsel.Selector(detail_html)
            img_url_ = '网址'+selector_2.css('#bigimg::attr(src)').get()
            img_list.append(img_url_)
        print(img_list)

7. 保存数据

        for img_url in img_list:
            img_data = requests.get(img_url).content
            img_name = img_url.split('/')[-1]   # 标题分割取最后一个内容
            with open(f'img/{img_name}', mode='wb') as f:
                f.write(img_data)