Python3网络爬虫：Scrapy入门之使用ImagesPipline下载图片

最新推荐文章于 2022-05-29 11:16:56 发布

Xiao布_unknown

最新推荐文章于 2022-05-29 11:16:56 发布

阅读量2.1k

点赞数 1

分类专栏： python scrapy-爬虫 python3 文章标签： python 网络爬虫图片 scrapy-爬虫

本文链接：https://blog.csdn.net/qq_24076135/article/details/78319055

版权

python 同时被 3 个专栏收录

12 篇文章 0 订阅

订阅专栏

python3

10 篇文章 0 订阅

订阅专栏

scrapy-爬虫

3 篇文章 0 订阅

订阅专栏

Python版本： python3.+
运行环境： Mac OS
IDE： pycharm

一前言
二初识ImagesPipline
三 ImagePipline修改图片默认下载名称
四小结

一、前言

上篇博客用了一个简单的实战熟悉了一下scrapy框架的使用。但是下载图片的方法使用的却是requests库，而scrapy本身就自带有图片下载的方法ImagesPipline。

二、初识ImagesPipline

1. ImagesPipline的特性:

避免重新下载最近已经下载过的数据
指定存储路径
将所有下载的图片转换成通用的格式（JPG）和模式（RGB）
缩略图生成
检测图像的宽/高，确保它们满足最小限制

2. ImagesPipline的工作流

在一个爬虫里，你抓取一个项目，把其中图片的URL放入 image_urls(type = list) 组内。
item从爬虫内返回，进入Item Piplines。
当item进入 ImagesPipeline，image_urls 组内的URLs将被Scrapy的调度器和下载器（这意味着调度器和下载器的中间件可以复用）安排下载，当优先级更高，会在其他页面被抓取前处理。项目会在这个特定的管道阶段保持“locker”的状态，直到完成文件的下载（或者由于某些原因未完成下载）。
当文件下载完后，另一个字段(files)将被更新到结构中。这个组将包含一个字典列表，其中包括下载文件的信息，比如下载路径、源抓取地址（从 image_urls 组获得）和图片的校验码(checksum)。 images 列表中的文件顺序将和源 image_urls 组保持一致。如果某个图片下载失败，将会记录下错误信息，图片也不会出现在 files 组中。

3.ImagesPipline使用样例

一、定义item

为了使用media pipeline，你仅需要启用 .
接着，如果spider返回一个具有 ‘file_urls’ 或者 ‘image_urls’(取决于使用Files 或者 Images
Pipeline) 键的dict，则pipeline会提取相对应(‘files’ 或 ‘images’)的结果。

如果你更喜欢使用 Item 来自定义item，则需要设置相应必要的字段，例如下面使用Image Pipeline的例子:


import scrapy

class MyItem(scrapy.Item):

    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

在这里我就自己定义了一个items

import scrapy

class MscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    image_ids = scrapy.Field()
    image_paths = scrapy.Field()
    pass

二、设置setting

首先需要在项目中添加 ITEM_PIPELINES

ITEM_PIPELINES = {'scrapy.pipeline.images.ImagesPipeline': 1}

接着 IMAGES_STORE 设置为一个有效的文件夹，用来存储下载的图片。否则管道将保持禁用状态，即使你在
ITEM_PIPELINES 设置中添加了它。

对于Images Pipeline, 设置 IMAGES_STORE

IMAGES_STORE = '/path/to/valid/dir'

关于缩略图等其他属性可以参看官方文档

三、 ImagePipline修改图片默认下载名称

1. 文档解读

在ImagePipline的诸多属性中需要特别注意的就是文件系统存储，因为它定义了文件保存时的默认名称，我们想要修改图片默认名称，就得从这里入手。

文件系统存储

文件以它们URL的 SHA1 hash 作为文件名。

比如，对下面的图片URL:

http://www.example.com/image.jpg 它的 SHA1 hash 值为:

3afec3b4765f8f0a07b78f98c07b83f013567a0a

将被下载并存为下面的文件:

< IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

其中:

<IMAGES_STORE> 是定义在 IMAGES_STORE 设置里的文件夹 > full是用来区分图片和缩略图（如果使用的话）的一个子文件夹。

我们当然不希望自己下载下来的图片名称是这一串无法理解的数字。所以我们需要修改它文件名。

官方文档中提供了2个可以重写的方法:

get_media_requests(item, info)
item_completed(results, items, info)

get_media_requests(item, info)

在工作流程中可以看到，管道会得到文件的URL并从项目中下载。为了这么做，你需要重写 get_media_requests()方法，并对各个图片URL返回一个Request:
def get_media_requests(self, item, info):
    for file_url in item['file_urls']:
        yield scrapy.Request(file_url) 
  这些请求将被管道处理，当它们完成下载后，结果将以2-元素的元组列表形式传送到 `item_completed()`方法: 每个元组包含
(success, file_info_or_error):

success 是一个布尔值，当图片成功下载时为True，因为某个原因下载失败为False file_info_or_error
是一个包含下列关键字的字典（如果成功为 True ）或者出问题时为 Twisted Failure 。 url - 文件下载的url。这是从get_media_requests() 方法返回请求的url。 path - 图片存储的路径（类似 FILES_STORE）
checksum - 图片内容的 MD5 hash item_completed() 接收的元组列表需要保证与
get_media_requests() 方法返回请求的顺序相一致。下面是 results 参数的一个典型值:
[(True,   {'checksum': '2b00042f7481c7b056c4b410d28f33cf',    'path':
'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',    'url':
'http://www.example.com/files/product1.pdf'}),  (False,  
Failure(...))] 
默认 get_media_requests() 方法返回 None ，这意味着项目中没有文件可下载。

item_completed(results, items, info)

当一个单独项目中的所有图片请求完成时（要么完成下载，要么因为某种原因下载失败），
FilesPipeline.item_completed() 方法将被调用。

item_completed() 方法需要返回一个输出，其将被送到随后的项目管道阶段，因此你需要返回（或者丢弃）项目，如你在任意管道里所做的一样。这里是一个
item_completed() 方法的例子，其中我们将下载的图片路径（传入到results中）存储到 image_paths
项目组中，如果其中没有图片，我们将丢弃项目:

from scrapy.exceptions import DropItem
def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not file_paths:
        raise DropItem("Item contains no files")
    item['image_paths'] = image_paths
    return item 
默认情况下， item_completed() 方法返回item。

下面是一个图片管道的完整例子，其方法如上所示:

import scrapy
from scrapy.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

2.代码实战

继续上篇博客的实战demo，在这里我修改了piplines下的代码

class UnsplashPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        if item['image_ids']:
            new_path = "full/"+item['image_ids'][0]+".jpg"
        os.rename(settings.IMAGES_STORE+"/"+image_paths[0],settings.IMAGES_STORE+"/"+new_path)
        item['image_paths'] = new_path
        return item

该方法实质上是在ImagesPipline完成默认文件名的保存后，将文件重命名。

3.ImagePipline源码浅析

如果阅读源码，会发现file_path()方法正是给图片赋文件名的方法。所以直接重写这个方法岂不是美滋滋。在这里，我们先来看一下file_path()方法的源码:

def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                          'please use file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'full/%s.jpg' % (image_guid)

如果只是为了修改文件路径而修改file_path，这对原代码侵入太大。所以官方文档里也没有建议重写file_path。

以下是ImagesPinpline的源码，供大家参考

class ImagesPipeline(FilesPipeline):
    """Abstract pipeline that implement the image thumbnail generation logic

    """

    MEDIA_NAME = 'image'
    MIN_WIDTH = 0
    MIN_HEIGHT = 0
    THUMBS = {}
    DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
    DEFAULT_IMAGES_RESULT_FIELD = 'images'

    @classmethod
    def from_settings(cls, settings):
        cls.MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
        cls.MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
        cls.EXPIRES = settings.getint('IMAGES_EXPIRES', 90)
        cls.THUMBS = settings.get('IMAGES_THUMBS', {})
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

        cls.IMAGES_URLS_FIELD = settings.get('IMAGES_URLS_FIELD', cls.DEFAULT_IMAGES_URLS_FIELD)
        cls.IMAGES_RESULT_FIELD = settings.get('IMAGES_RESULT_FIELD', cls.DEFAULT_IMAGES_RESULT_FIELD)
        store_uri = settings['IMAGES_STORE']
        return cls(store_uri)

    def file_downloaded(self, response, request, info):
        return self.image_downloaded(response, request, info)

    def image_downloaded(self, response, request, info):
        checksum = None
        for path, image, buf in self.get_images(response, request, info):
            if checksum is None:
                buf.seek(0)
                checksum = md5sum(buf)
            width, height = image.size
            self.store.persist_file(
                path, buf, info,
                meta={'width': width, 'height': height},
                headers={'Content-Type': 'image/jpeg'})
        return checksum

    def get_images(self, response, request, info):
        path = self.file_path(request, response=response, info=info)
        orig_image = Image.open(StringIO(response.body))

        width, height = orig_image.size
        if width < self.MIN_WIDTH or height < self.MIN_HEIGHT:
            raise ImageException("Image too small (%dx%d < %dx%d)" %
                                 (width, height, self.MIN_WIDTH, self.MIN_HEIGHT))

        image, buf = self.convert_image(orig_image)
        yield path, image, buf

        for thumb_id, size in self.THUMBS.iteritems():
            thumb_path = self.thumb_path(request, thumb_id, response=response, info=info)
            thumb_image, thumb_buf = self.convert_image(image, size)
            yield thumb_path, thumb_image, thumb_buf

    def convert_image(self, image, size=None):
        if image.format == 'PNG' and image.mode == 'RGBA':
            background = Image.new('RGBA', image.size, (255, 255, 255))
            background.paste(image, image)
            image = background.convert('RGB')
        elif image.mode != 'RGB':
            image = image.convert('RGB')

        if size:
            image = image.copy()
            image.thumbnail(size, Image.ANTIALIAS)

        buf = StringIO()
        image.save(buf, 'JPEG')
        return image, buf

    def get_media_requests(self, item, info):
        return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]

    def item_completed(self, results, item, info):
        if self.IMAGES_RESULT_FIELD in item.fields:
            item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
        return item

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                          'please use file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'full/%s.jpg' % (image_guid)

    def thumb_path(self, request, thumb_id, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.thumb_key(url) method is deprecated, please use '
                          'thumb_path(request, thumb_id, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from thumb_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if thumb_key() method has been overridden
        if not hasattr(self.thumb_key, '_base'):
            _warn()
            return self.thumb_key(url, thumb_id)
        ## end of deprecation warning block

        thumb_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'thumbs/%s/%s.jpg' % (thumb_id, thumb_guid)

    # deprecated
    def file_key(self, url):
        return self.image_key(url)
    file_key._base = True

    # deprecated
    def image_key(self, url):
        return self.file_path(url)
    image_key._base = True

    # deprecated
    def thumb_key(self, url, thumb_id):
        return self.thumb_path(url, thumb_id)
    thumb_key._base = True

四、小结

scrapy本身提供的工具已经很丰富而且实用。我对scrapy的理解很有限，仅仅是入门，本篇博客也只是我对imagesPipline自学后的总结，如有错，望指正。

Xiao布_unknown

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
Python3网络爬虫：Scrapy入门之使用ImagesPipline下载图片

Python版本： python3.+ 运行环境： Mac OS IDE： pycharm一前言二初识ImagesPiplineImagesPipline的特性ImagesPipline的工作流ImagesPipline使用样例三 ImagePipline修改图片默认下载名称文档解读代码实战ImagePipline源码浅析四小结一、前言上篇博客用了一个简单的实战熟悉了一
复制链接

扫一扫

专栏目录