第九章下载文件和图片

最新推荐文章于 2021-05-20 11:52:28 发布

三丁目の夕阳下的白菜

最新推荐文章于 2021-05-20 11:52:28 发布

阅读量459

点赞数

文章标签： scrapy python 爬虫

本文链接：https://blog.csdn.net/keenshinsword/article/details/79097301

版权

Scrapy框架提供了两个Item Pipeline，专门用于下载文件和图片：

FilesPipeline
ImagesPipeline

使用时只需要通过item的一个特殊字段将要下载文件或图片的url传递给它们，就会自动将文件或图片下载到本地，并将下载结果信息存入item的另一个特殊字段，以便在导出文件中查阅。

FilesPipeline使用说明

1. 在配置文件settings.py中，启用FilesPipeline，通常将其置于其他Item Pipeline之前

ITEM_PIPELINES = {
   'scrapy.pipelines.files.FilesPipeline':1,
}

2. 在配置文件settings.py中，使用FILES_STORE指定文件下载目录

FILES_STORE = '/home/keenshin/Downloads'

3. 在Spider解析一个包含下载链接的页面时，将所有需要下载文件的url地址收集到一个列表，赋给item的file_urls字段（item[‘file_urls’]）。FilesPipeline在处理每一项item时，会读取item[‘file_urls’]，对其中每一个url进行下载

class DownloadBookSpider(scrapy.Spider):
    name = 'download'

    def parse(self, response):
        item = {}
        # 下载列表
        item['file_urls'] = []
        for url in response.xpath('//a/@href').extract():
            download_url = response.urljoin(url)
            # 将url填入下载列表
            item['file_urls'].append(download_url)

        yield item

当FilesPipeline下载完item[‘file_urls’]中所有的文件后，会将各文件的下载结果信息收集到另一个列表，赋给item的files字段（item[‘files’]）。下载结果信息包括以下内容：

path 文件下载到本地路径
checksum 文件的校验和
url 文件的url地址

ImagesPipeline使用说明

ImagesPipeline是FilesPipeline的子类，使用方法和FilesPipeline大同小异，只是在所使用的item字段和配置选项上略有不同

	FilesPipeline	ImagesPipeline
导入路径	scrapy.pipelines.files.FilesPipeline	scrapy.pipelines.images.ImagesPipeline
Item字段	file_urls,files	image_urls,images
下载目录	FILES_STORE	IMAGES_STORE

ImagesPipeline 在 FilesPipeline 的基础上针对图片增加了一些特有的功能：

为图片生成缩略图

开启此功能，只需在配置文件settings.py中设置IMAGES_THUMBS,它是一个字典，每一项的值是缩略图的尺寸

IMAGES_THUMBS = {
    'smail': (50, 50),
    'big': (270, 270),
}

开启该功能后，下载一张图时，本地会出现3张图片（1张原图和2张缩略图）

过滤掉尺寸过小的图片

开启此功能，需在配置文件settings.py中设置IMAGES_MIN_WIDTH和IMAGES_MIN_HEIGHT
```
IMAGES_MIN_WIDTH = 110
IMAGES_MIN_HEIGHT = 110
```

项目实战

项目需求：下载 https://matplotlib.org/examples/index.html网站中所有例子的源码文件到本地

页面分析

可以看到，所有例子页面的链接都在<li class="toctree-l2">中

接下来看一下例子页面，打开一个例子
这里写图片描述

可以发现例子的源码都在<a class="reference external">中

代码实现

步骤：

创建Scrapy项目，并使用scrapy genspider命令创建Spider
在配置文件中启用FilesPipeline,并指定文件下载目录
实现ExampleItem（可选）
实现ExamplesSpider

1. 首先创建Scrapy项目，取名为 matplotlib_examples，再使用scrapy genspider命令创建Spider

scrapy startproject matplotlib_examples
cd matplotlib_exmpales
scrapy genspider examples matplotlib.org

2. 在配置文件中启用FilesPipeline，并指定文件下载目录

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1,
}

FILES_STORE = 'examples_src'

3. 实现ExampleItem，需定义file_urls 和 files 两个字段，在items.py中

class ExampleItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

4. 实现ExamplesSpider

class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    start_urls = ['https://matplotlib.org/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_css='li.toctree-l2')
        links = le.extract_links(response)
        for link in links:
            yield scrapy.Request(link.url, callback=self.parse_example)

    def parse_example(self, response):
        link = response.xpath('//a[@class="reference external"]/@href').extract_first()
        url = response.urljoin(link)
        example = ExampleItem()
        example['file_urls'] = [url]
        return example

运行爬虫

scrapy crawl examples -o examples.json

看一看运行结果，tree examples_src，所有的源码都被下载到examples_src/full目录下，并且每个文件都是一串长度相等的数字？例如这样132cd377a2ad813d2f9481a509f80f7abd75ea9f.py

其实，这些数字是下载文件url的sha1序列值

这种命名方式可防止重名的文件相互覆盖，但是太不直观了！我们希望的是这些文件能按照类别下载到不同目录下，并且名字依据examples.json的信息将文件重命名

FilesPipeline源码中，file_path方法决定了文件的命名，所以我们可以实现一个FilesPipeline的子类，复写 file_path方法来实现文件命名规则，这些源码url的最后两部分是类别和文件名
在pipelines.py实现MyFilesPipeline

from os.path import basename, dirname, join
from urllib.parse import urlparse
class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path = urlparse(request.url).path
        return join(basename(dirname(path)), basename(path))

修改配置文件

ITEM_PIPELINES = {
    # 'scrapy.pipelines.files.FilesPipeline': 1,
    'matplotlib_examples.pipelines.MyFilesPipeline': 1,
}

备注：

urlparse

urlparse模块主要是把url拆分为6部分，并返回元组。并且可以把拆分后的部分再组成一个url。主要有函数有urljoin、urlsplit、urlunsplit、urlparse等。

urlparse.urlparse(urlstring[, scheme[, allow_fragments]])

>>> url = urlparse('https://v.qq.com/x/cover/z1263mbpqj2aarn/x0015o12f3z.html')
>>> url
ParseResult(scheme='https', netloc='v.qq.com', path='/x/cover/z1263mbpqj2aarn/x0015o12f3z.html', params='', query='', fragment='')
>>> url.path
'/x/cover/z1263mbpqj2aarn/x0015o12f3z.html'