Scrapy 练习（一）下载壁纸图，使用ImagesPipeline

最新推荐文章于 2024-08-24 16:34:34 发布

耿子666

最新推荐文章于 2024-08-24 16:34:34 发布

阅读量1.1w

点赞数 1

分类专栏： python-scrapy ④------Python------ 文章标签： scrapy 爬取图片 python 下载 ImagesPipeline

本文为博主原创文章，未经博主允许不得转载。-QQ1164014750-微信公众号：耿子blog

本文链接：https://blog.csdn.net/qq_28817739/article/details/79904391

版权

④------Python------ 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

python-scrapy

8 篇文章 1 订阅

订阅专栏

（1）准备工作

我们准备爬取的网站：https://alpha.wallhaven.cc/random

分析该网站图片的标签：这是一张图片的标签

<html>
 <head></head>
 <body>
  <li class="">
   <figure class="thumb thumb-316105 thumb-sfw thumb-general" data-wallpaper-id="316105" style="width:300px;height:200px">
    <img alt="loading" class="lazyload loaded" data-src="https://alpha.wallhaven.cc/wallpapers/thumb/small/th-316105.jpg" src="https://alpha.wallhaven.cc/wallpapers/thumb/small/th-316105.jpg" />
    <a class="preview" href="https://alpha.wallhaven.cc/wallpaper/316105" target="_blank"></a>
    <div class="thumb-info">
     <span class="wall-res">1920 x 1280</span>
     <a class="overlay-anchor wall-favs" href="https://alpha.wallhaven.cc/wallpaper/316105/favorites">3<i class="fa fa-fw fa-star"></i></a>
     <a class="jsAnchor thumb-tags-toggle tagged" title="Tags" href="https://alpha.wallhaven.cc/wallpaper/316105/thumbTags"><i class="fa fa-fw fa-tags"></i></a>
    </div>
   </figure></li>
 </body>
</html>

xpath解析一下：

//figure/@data-wallpaper-id 可以获取图片的编号集合

再根据图片的编号获取到整个标签

//figure[@data-wallpaper-id="316105"] （这里的 316105 就是图片的编号）

就可以获取整个 figure 标签，然后就可以抓取需要的信息了，具体字段分析见后面爬虫部分

（2）新建scrapy项目

命令： scrapy startproject wallhavenSpider

目录结构如下：

通常需要编辑的几个py有：settings.py 和 items.py 和 pipelines.py

1、配置settings.py

修改 USER_AGENT

USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

启用ITEM_PIPELINES 配置，本次我们就使用默认生成的item即可。

ITEM_PIPELINES = {
    'wallhavenSpider.pipelines.WallhavenspiderPipeline': 300,
}

我们在配置一个存放图片的路径。

#编写自定义配置字段
IMAGES_STORE = "H:\\python_workspace\\scrapy\\imgaedownload"

其他视需要再配置。

2、编写items.py

该文件就是设置我们需要存储的字段信息。

class WallhavenspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # 图片的id
    imageId = scrapy.Field()
    # 图片的缩略图路径
    imageThumbnailUrl = scrapy.Field()
    # 图片的分辨率
    imageSize = scrapy.Field()
    # 图片的下载路径
    imageDownloadUrl = scrapy.Field()
    # 图片的tag的路径
    imageTagUrl = scrapy.Field()
    # 图片保存的路径
    #imagePath = scrapy.Field()

pipelines.py 我们在编写完爬虫再编写。

（3）创建爬虫程序

进入spiders目录，使用命令：scrapy genspider imageInfoDownload "alpha.wallhaven.cc"

会在spiders目录下创建一个爬虫程序的模板。

1、具体分析网站如何爬取

首先确定爬取网站网址：https://alpha.wallhaven.cc/random

我们来进入下一页，查看网址变化，确定该网站的分页的字段：

https://alpha.wallhaven.cc/random?page=x

通过page 字段来实现分页，那么我们可以根据 page 值的变化，爬取每页的信息

在文章最开始，图片信息已经分析了一下。

那么就具体使用xpath 将图片信息拿出来

（1）获取一个图片的整体标签体：

//figure[@data-wallpaper-id="+id+"]

这里的id 是一个图片的id标识，到时候我们得到一页中所有图片id,分页爬取

那我们先获取一个 id 进行测试。

（2）图片的缩略图的路径

//figure[@data-wallpaper-id="378330"]/img/@data-src

（3）图片的分辨率

//figure[@data-wallpaper-id="378330"]/div/span/text()

（4）图片的真实路径

这个路径，需要点击图片进入图片的详情页才能看到。

但是，经过发现这些图片的前缀路径都是一致的，变化的仅是图片的id。

我们就可以使用拼接的方式，拿到图片的id将图片的真实路径拼接出来。

2、编写爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from wallhavenSpider.items import WallhavenspiderItem

class ImageinfodownloadSpider(scrapy.Spider):
    """
     爬取图片信息
    """
    name = 'imageInfoDownload'
    allowed_domains = ['alpha.wallhaven.cc']
    #拼接请求分页的路径
    url = 'https://alpha.wallhaven.cc/random?page='
    offset = 1

    reqUrl = url + str(offset)

    start_urls = [reqUrl]


    def parse(self, response):
        """
         解析response
        :param response:
        :return:
        """

        imageId_list = response.xpath("//figure/@data-wallpaper-id")

        for imageid in imageId_list:
            #创建一个新的 item
            item = WallhavenspiderItem()

            id = imageid.extract()
            #图片的编号
            item['imageId'] = id
            #根据图片id进行解析
            imageinfo = response.xpath("//figure[@data-wallpaper-id="+id+"]")

            for imginfo in imageinfo:
                # 图片的缩略图
                item['imageThumbnailUrl'] = imginfo.xpath("./img/@data-src").extract()[0]
                #图片的分辨率
                item['imageSize'] = imginfo.xpath("./div/span/text()").extract()[0]
                
                item['imageTagUrl'] = imginfo.xpath('./div/a[@title="Tags"]/@href').extract()[0]
                #https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-634130.jpg
                #截取图片的后缀
                imgSuffix = item['imageThumbnailUrl'].split('.')[-1]
                item['imageDownloadUrl'] = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-'+id+'.'+imgSuffix

            #交给管道文件进行处理
            yield item

        # 分页请求，控制页码变化
        if self.offset < 13320:
            self.offset += 1
        # else:
        #     raise "结束工作"
        #处理完一页，再次发送分页请求
        yield scrapy.Request(self.url + str(self.offset), callback=self.parse)

运行测试爬虫：scrapy crawl imageInfoDownload

看是否正常爬取。

（4）编写pipelines.py

pipelines 文件就是用来处理 item（数据）的地方

我们现在的需求是下载图片

scrapy 中提供了一个对图片下载的 pipline 文件：ImagesPipeline

引入该pipline 的方式：

from scrapy.pipelines.images import ImagesPipeline

在该ImagesPipeline 中提供了两个方法，支持下载的功能

get_media_requests 和 item_completed

看下具体代码实现：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#引入settings.py 的配置项
import scrapy
import json
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
import os

class WallhavenspiderPipeline(ImagesPipeline):
    #获取在 settings 文件中的配置项
    IMAGE_SOURCE = get_project_settings().get('IMAGES_STORE')


    def get_media_requests(self, item, info):
        image_url = item['imageDownloadUrl'] #拿到图片的真实路径
        yield scrapy.Request(image_url)


    def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]
        os.rename(self.IMAGE_SOURCE + "\\" + image_path[0], self.IMAGE_SOURCE + "\\" + item["imageId"] + ".jpg")
        item['imageDownloadUrl'] = image_path
        return item

os.rename(self.IMAGE_SOURCE + "\\" + image_path[0], self.IMAGE_SOURCE + "\\" + item["imageId"] + ".jpg")

这一句主要将图片改名称

其中这两个方法的编写，基本可以作为模板代码使用。只需要修改部分参数即可。