Charpter9 下载文件和图片

最新推荐文章于 2024-08-15 01:55:01 发布

lee's work

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量665

点赞数

分类专栏： scrapy学习文章标签： python 数据挖掘开发语言

本文链接：https://blog.csdn.net/qq_27608761/article/details/121357027

版权

scrapy学习专栏收录该内容

10 篇文章 1 订阅

订阅专栏

文章目录

第9章下载文件和图片

第9章下载文件和图片

下载文件也是实际应用中很常见的一种需求，例如使用爬虫爬取网站中的图片、视频、WORD文档、PDF文件、压缩包等。本章来学习在Scrapy中如何下载文件和图片。

9.1　FilesPipeline和ImagesPipeline

Scrapy框架内部提供了两个Item Pipeline，专门用于下载文件和图片：

FilesPipeline
ImagesPipeline

可将这两个Item Pipeline看作特殊的下载器，用户使用时只需要通过item的一个特殊字段将要下载文件或图片的url传递给它们，它们会自动将文件或图片下载到本地，并将下载结果信息存入item的另一个特殊字段，以便用户在导出文件中查阅。

下面详细介绍如何使用它们。

9.1.1　FilesPipeline使用说明

一个简单的例子讲解FilesPipeline的使用，在如下页面中可以下载多本PDF格式的小说

<html>
	<body>
		...
		<a href='/book/sg.pdf'>下载《三国演义》</a>
		<a href='/book/shz.pdf'>下载《水浒传》</a>
		<a href='/book/hlm.pdf'>下载《红楼梦》</a>
		<a href='/book/xyj.pdf'>下载《西游记》</a>
		...
	</body>
</html>

使用FilesPipeline下载页面中所有PDF文件，可按以下步骤进行：

步骤01　在配置文件settings.py中启用FilesPipeline，通常将其置于其他Item Pipeline之前：

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

步骤02　在配置文件settings.py中，使用FILES_STORE指定文件下载目录，如：

FILES_STORE = '/home/liushuo/Download/scrapy'

步骤03　在Spider解析一个包含文件下载链接的页面时，将所有需要下载文件的url地址收集到一个列表，赋给item的file_urls字段（item[‘file_urls’]）。FilesPipeline在处理每一项item时，会读取item[‘file_urls’]，对其中每一个url进行下载，Spider示例代码如下：

class DownloadBookSpider(scrapy.Spider):
	...
	def parse(response):
		item = {}
		# 下载列表
		item['file_urls'] = []
		for url in response.xpath('//a/@href').extract():
			download_url = response.urljoin(url)
			# 将url 填入下载列表
			item['file_urls'].append(download_url)
	yield item

当FilesPipeline下载完item[‘file_urls’]中的所有文件后，会将各文件的下载结果信息收集到另一个列表，赋给item的files字段（item[‘files’]）。下载结果信息包括以下内容：

Path文件下载到本地的路径（相对于FILES_STORE的相对路径）。
Checksum文件的校验和。
url文件的url地址。

9.1.2　ImagesPipeline使用说明

图片也是文件，所以下载图片本质上也是下载文件，ImagesPipeline是FilesPipeline的子类，使用上和FilesPipeline大同小异，只是在所使用的item字段和配置选项上略有差别，如表9-1所示。

表9-1　ImagesPipeline和FilesPipeline

	FilePipeline	ImagesPipeline
导入路径	scrapy.pipelines.files.FilePipelines	scrapy.pipelines.images.ImagesPipelines
Item字段	file_urls,files	images_urls,images
下载目录	FILES_STORE	IMAGES_STORE

ImagesPipeline在FilesPipleline的基础上针对图片增加了一些特有的功能：

为图片生成缩略图开启该功能，只需在配置文件settings.py中设置IMAGES_THUMBS，它是一个字典，每一项的值是缩略图的尺寸，代码如下：

IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}

开启该功能后，下载一张图片时，本地会出现3张图片（1张原图片，2张缩略图），路径如下：

[IMAGES_STORE]/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
[IMAGES_STORE]/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
[IMAGES_STORE]/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24

过滤掉尺寸过小的图片
开启该功能，需在配置文件settings.py中设置IMAGES_MIN_WIDTH和IMAGES_MIN_HEIGHT，它们分别指定图片最小的宽和高，代码如下：

 IMAGES_MIN_WIDTH = 110
IMAGES_MIN_HEIGHT = 110

开启该功能后，如果下载了一张105×200的图片，该图片就会被抛弃掉，因为它的宽度不符合标准。

9.2　项目实战：爬取matplotlib例子源码文件

下面我们来完成一个使用FilesPipeline下载文件的实战项目。
matplotlib是一个非常著名的Python绘图库，广泛应用于科学计算和数据分析等领域。在matplotlib网站上提供了许多应用例子代码，在浏览器中访问https://matplotlib.org/2.0.2/examples/index.html，可看到图9-1所示的例子列表页面。
其中有几百个例子，被分成多个类别，单击第一个例子，进入其页面，如图9-2所示，地址https://matplotlib.org/2.0.2/examples/animation/animate_decay.html。
用户可以在每个例子页面中阅读源码，也可以点击页面中的source code按钮下载源码文件。如果我们想把所有例子的源码文件都下载到本地，可以编写一个爬虫程序完成这个任务。
图9-1
图9-2

9.2.1　项目需求

下载http://matplotlib.org网站中所有例子的源码文件到本地。

9.2.2　页面分析

先来看如何在例子列表页面http://matplotlib.org/examples/index.html中获取所有例子页面的链接。使用scrapy shell命令下载页面，然后调用view函数在浏览器中查看页面，如图9-3所示。

$ scrapy shell https://matplotlib.org/2.0.2/examples/index.html
2021-11-16 16:06:17 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: toscrapy_book)
2021-11-16 16:06:17 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0,
Twisted 21.2.0, Python 3.7.11 (default, Jul 27 2021, 09:46:33) [MSC v.1916 32 bit (Intel)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 A
ug 2021), cryptography 3.4.7, Platform Windows-7-6.1.7600
2021-11-16 16:06:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-16 16:06:17 [scrapy.crawler] INFO: Overridden ...
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x057023B0>
[s]   item       {}
[s]   request    <GET https://matplotlib.org/2.0.2/examples/index.html>
[s]   response   <200 https://matplotlib.org/2.0.2/examples/index.html>
[s]   settings   <scrapy.settings.Settings object at 0x056FD330>
[s]   spider     <DefaultSpider 'default' at 0x5a9d8f0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

>>> view(response)
True

图9-3

观察发现，所有例子页面的链接都在< a div class=“toctree-wrapper compound”>下的每一个< li class=“toctree-l2”> 中，例如：

<a class="reference internal" href="animation/basic_example.html">basic_example</a>

使用LinkExtractor提取所有例子页面的链接，代码如下：


>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(restrict_css='div.toctree-wrapper.compound')
>>> links = le.extract_links(response)
>>> [link for link in links]
[Link(url='https://matplotlib.org/2.0.2/examples/animation/index.html', text='animation Examples', fragment='',nofollow=False),...,Link(url='https://matplotlib.org/2.0.2/examples/widgets/span_selector.html', text='span_selector', fragment='', nofollow=False)]

>>> [link for link in links][0]
Link(url='https://matplotlib.org/2.0.2/examples/animation/index.html', text='animation Examples', fragment='', nofollow=False)
>>> [link for link in links][0].url
'https://matplotlib.org/2.0.2/examples/animation/index.html'
>>> [link for link in links].url #list 无url属性，报错
...
>>>[link.url for link in links]
['https://matplotlib.org/2.0.2/examples/animation/index.html', ... , 'https://matplotlib.org/2.0.2/examples/widgets/span_selector.html']
>>> len(links) #比书里多了25条
532

接下来分析例子页面。调用fetch函数下载第一个例子页面，并调用view函数在浏览器中查看页面，如图9-4所示。

>>> fetch(https://matplotlib.org/2.0.2/examples/animation/index.html)
  File "<console>", line 1
    fetch(https://matplotlib.org/2.0.2/examples/animation/index.html)
               ^
SyntaxError: invalid syntax #地址为字符串类型需要加''
>>> fetch('https://matplotlib.org/2.0.2/examples/animation/index.html')
2021-11-16 16:53:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/2.0.2/exam
ples/animation/index.html> (referer: None)#200 没毛病
>>> view(response)
True #直接浏览器显示页面

>>> href =response.css('a.reference.internal::attr(href)').extract_first()
>>> href
'animate_decay.html'
>>> response.urljoin(href)
'https://matplotlib.org/2.0.2/examples/animation/animate_decay.html'
>>> response #response 还停留在index.html页面
<200 https://matplotlib.org/2.0.2/examples/animation/index.html>
>>> fetch( response.urljoin(href)) #抓取结合后的url，response变为animate_decay.html页面
2021-11-16 17:07:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/2.0.2/exam
ples/animation/animate_decay.html> (referer: None)
>>> response
<200 https://matplotlib.org/2.0.2/examples/animation/animate_decay.html>

图9-4

图9-4 在一个例子页面中，例子源码文件的下载地址可在< a class="reference external">中找到：

<a class="reference internal" href="animate_decay.html">animate_decay</a>

<a class="reference external" href="animate_decay.py">source code</a>

#提取source code 'animate_decay.py'

>>> href = response.css('a.reference.external::attr(href)')
>>> href
[<Selector xpath="descendant-or-self::a[@class and contains(concat(' ', normalize-space(@class), ' '
), ' reference ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' external '))
]/@href" data='animate_decay.py'>] #selectorlist格式，data中的数据才是需要提取的
>>> href = response.css('a.reference.external::attr(href)').extract() # 此法提取的类型为list
>>> href 
['animate_decay.py']
>>> response.urljoin(href)
Traceback (most recent call last):
 ...
TypeError: Cannot mix str and non-str arguments #只能和字符串进行联结

>>> href = response.css('a.reference.external::attr(href)').extract()[0]# 处理后href为str
>>> href
'animate_decay.py'
>>> response
<200 https://matplotlib.org/2.0.2/examples/animation/animate_decay.html>
>>> response.urljoin(href) #成功联结
'https://matplotlib.org/2.0.2/examples/animation/animate_decay.py'
>>>

到此，页面分析的工作完成了。

9.2.3　编码实现

按以下4步完成该项目：
（1）创建Scrapy项目，并使用scrapy genspider命令创建Spider。
（2）在配置文件中启用FilesPipeline，并指定文件下载目录。
（3）实现ExampleItem（可选）。
（4）实现ExamplesSpider。
步骤01　首先创建Scrapy项目，取名为matplotlib_examples，再使用scrapy genspider命令创建Spider：

$ ScrapyTest>scrapy startproject matplotlib_examples
New Scrapy project 'matplotlib_examples', using template directory 'E:\ProgramData\Anaconda3\envs\pythonProject4\lib\site-p
ackages\scrapy\templates\project', created in:
    F:\python_work\ScrapyTest\matplotlib_examples

You can start your first spider with:
    cd matplotlib_examples
    scrapy genspider example example.com

$ ScrapyTest>cd matplotlib_examples

$ ScrapyTest\matplotlib_examples>scrapy genspider exmaples matplotlib.org
Created spider 'exmaples' using template 'basic' in module:
  matplotlib_examples.spiders.exmaples

生成的代码
图9.2.3-1
步骤02　在配置文件settings.py中启用FilesPipeline，并指定
文件下载目录，代码如下

ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = 'examples_src'

步骤03　实现ExampleItem，需定义file_urls和files两个字段，在items.py中完成如下代码：

import scrapy
class MatplotlibExamplesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()
    #pass

步骤04　实现ExamplesSpider。首先设置起始爬取点：

import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import MatplotlibExamplesItem
class ExmaplesSpider(scrapy.Spider):
    name = 'exmaples'
    allowed_domains = ['matplotlib.org']
    #start_urls = ['http://matplotlib.org/']
    start_urls =['https://matplotlib.org/2.0.2/examples/index.html'] #

    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',
                           deny='/index.html$')
        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse_example)

    def parse_example(self, response):
        href = response.css('a.reference.external::attr(href)').extract_first()
        url = response.urljoin(href)
        example = MatplotlibExamplesItem()
        example['file_urls'] = [url]
        return example
        #pass

上面代码中，parse方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造Request对象并提交;
parse_example方法为将例子页面的解析函数设置。例子页面中包含了例子源码文件的下载链接，在parse_example方法中获取源码文件的url，将其放入一个列表，赋给ExampleItem的file_urls字段。
编码完成后，运行爬虫，并观察结果：

$ scrapy crawl examples -o examples.json
   ...

运行结束后，在文件examples.json中可以查看到文件下载结果信息：
图9.2.3-2
再来查看文件下载目录exmaples_src：
图9.2.3-3
如上所示，507个源码文件被下载到了examples_src/full目录下，并且每个文件的名字都是一串长度相等的奇怪数字，这些数字是下载文件url的sha1散列值。例如，某文件url为：

http://matplotlib.org/mpl_examples/axes_grid/demo_floating_axes.py

该url的sha1散列值为：

d9b551310a6668ccf43871e896f2fe6e0228567d

那么该文件的存储路径为：

# [FILES_STORE]/full/[SHA1_HASH_VALUE].py
examples_src/full/d9b551310a6668ccf43871e896f2fe6e0228567d.py

这种命名方式可以防止重名的文件相互覆盖，但这样的文件名太不直观了，无法从文件名了解文件内容，我们期望把这些例子文件按照类别下载到不同目录下，为完成这个任务，可以写一个单独的脚本，依据examples.json文件中的信息将文件重命名，也可以修改FilesPipeline为文件命名的规则，这里采用后一种方式。
阅读FilesPipeline的源码发现，原来是其中的file_path方法决定了文件的命名，相关代码如下：

class FilesPipeline(MediaPipeline):
...
    def file_path(self, request, response=None, info=None, *, item=None):
        media_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        media_ext = os.path.splitext(request.url)[1]
        # Handles empty and wild extensions by trying to guess the
        # mime type then extension or default to empty string otherwise
        if media_ext not in mimetypes.types_map:
            media_ext = ''
            media_type = mimetypes.guess_type(request.url)[0]
            if media_type:
                media_ext = mimetypes.guess_extension(media_type)
        return f'full/{media_guid}{media_ext}'
...

现在，我们实现一个FilesPipeline的子类，覆写file_path方法来实现所期望的文件命名规则，这些源码文件url的最后两部分是类别和文件名，例如：

http://matplotlib.org/mpl_examples/(axes_grid/demo_floating_axes.py)

可用以上括号中的部分作为文件路径，在pipelines.py实现MyFilesPipeline，代码如下：

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import join, dirname, basename
...

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path = urlparse(request.url).path
        return join(basename(dirname(path)), basename(path))

修改配置文件，使用MyFilesPipeline替代FilesPipeline：

ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 1,
'matplotlib_examples.pipelines.MyFilesPipeline': 1,
}

修改之前文件夹为examples_src1，重新运行爬虫后，再来查看examples_src目录：

图9.2.3-4
从上述结果看出，507个文件按类别被下载到26个目录下，这正是我们所期望的。
到此，文件下载的项目完成了。

本文参照《精通Scrapy网络爬虫+（刘硕著）》PDF，并自己跑相关代码，代码内容稍作修改，仅做参考和笔记复习使用

lee's work

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Charpter9 下载文件和图片

第9章下载文件和图片下载文件也是实际应用中很常见的一种需求，例如使用爬虫爬取网站中的图片、视频、WORD文档、PDF文件、压缩包等。本章来学习在Scrapy中如何下载文件和图片。9.1　FilesPipeline和ImagesPipeline Scrapy框架内部提供了两个Item Pipeline，专门用于下载文件和图片：FilesPipelineImagesPipeline 可将这两个Item Pipeline看作特殊的下载器，用户使用时只需要通过item的一个特殊
复制链接

扫一扫