scrapy下载文件（《精通Scrapy网络爬虫》第九章）

最新推荐文章于 2024-05-16 11:47:28 发布

mt 2333

最新推荐文章于 2024-05-16 11:47:28 发布

阅读量345

点赞数 1

分类专栏：爬虫文章标签： scrapy下载文件爬虫

本文链接：https://blog.csdn.net/qq_43710705/article/details/100082823

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

爬取matplotlib例子源码文件

1.需求分析

下载http://matplotlib.org网站中所有例子的源码文件到本地。

2.页面分析

先来看看如何在例子页面https://matplotlib.org/examples/index.html中获取所有例子的链接。使用scrapy shell命令下载页面，然后调用view函数在浏览器中查看页面。

scrapy shell https://matplotlib.org/examples/index.html
...
view(response)

在这里插入图片描述
观察发现，所有例子的链接都在<div class="toctree-wrapper compound">下的一个
<li class="toctree-l2">中。
使用LinkExtractor提取所有例子页面的链接

from scrapy.linkextractors import LinkExtractor
le=LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
links=le.extract_links(response)
[link.url for link in links]

在这里插入图片描述
接下来分析例子页面。调用fetch函数(在scrapy shell中）下载第一个例子页面，并调用view函数在浏览器中查看页面。

fetch('https://matplotlib.org/examples/animation/animate_decay.html')
...
view(response)

在这里插入图片描述
例子源码文件的下载地址可在<a class="reference external">中找到：

In [2]: href=response.css('a.reference.external::attr(href)').extract_first()
   ...: href
Out[2]: 'animate_decay.py'

In [3]: response.urljoin(href)
Out[3]: 'https://matplotlib.org/examples/animation/animate_decay.py'

分析完成。

3.编码实现

分4步骤：
（1）创建scrapy项目，并使用scrapy genspider命令创建Spider。
（2）在配置文件中启用FilesPipeline，并指定文件下载目录。
（3）实现ExampleItem(可选)。
（4）实现ExamplesSpider。
(1)首先创建Scrapy项目，取名为matplotlib_examples,再使用scrapy genspider命令创建 Spider:

$scrapy startproject matplotlib_examples
$cd matplotlib_examples
$scrapy genspider examples matplotlib.org

(2)在配置文件settings.py中启用FilesPipeline，并指定文件下载目录，代码如下：

ITEM_PIPELINES={
    'scrapy.pipelines.files.FilesPipeline':1,
}
FILES_STORE='examples_src'

(3)实现ExampleItem，需定义file_urls和files两个字段，在items.py中完成如下代码：

class ExampleItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

(4)实现ExamplesSpider。首先设置起始爬取点：

import scrapy

class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    start_urls = ['http://matplotlib.org/examples/index.html']

    def parse(self, response):
        pass

parse方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造Request对象并提交，提取链接的细节分析时讨论过了，实现parse方法的代码如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor

class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    start_urls = ['http://matplotlib.org/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',deny='/index.html$')

        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url,callback=self.parse_example)
            
    def parse_example(self,response):
        pass

上面代码中，我们将例子页面的解析函数设置为parse_example方法，下面来实现这个方法。例子页面中包含了例子源码文件的下载链接，在parse_example方法中获取源码文件的url，将其放入一个列表，赋给ExampleItem的file_urls字段。实现parse_example方法的代码如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import ExampleItem

class ExamplesSpider(scrapy.Spider):
    name = 'examples'
    allowed_domains = ['matplotlib.org']
    start_urls = ['http://matplotlib.org/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound',deny='/index.html$')

        print(len(le.extract_links(response)))
        for link in le.extract_links(response):
            yield scrapy.Request(link.url,callback=self.parse_example)

    def parse_example(self,response):
        href = response.css('a.reference.external::attr(href)').extract_first()
        url = response.urljoin(href)
        example = ExampleItem()
        example['file_urls']=[url]
        return example

编码完成后，运行爬虫，并观察结果：
直接运行会出错，爬取不到任何东西，
将ROBOTSTXT_OBEY = True改为False，因为这个网站在robot.txt里面禁止爬虫了，所以要改成不按照该爬取协议，否则会出错。
examples.json中

在这里插入图片描述
源码文件被下载到examples_src/full目录下，文件名都是一长串的数字，即下载文件url的sha 1散列值，但这样命名不太直观，我们期望把这些例子文件按照类别下载到不同目录下。为此，单独写一个脚本，依据examples.json文件中的信息将文件重命名，也可以修改FilePipeline为文件命名的规则，这里采用后一种。
FilesPipeline源码中，file_path方法决定了文件命名，所以，我们实现FilesPipeline的子类，覆写file_path方法来实现所期望的文件命名规则，这些源码文件url的最后两部分是类别和文件名，例如:
https://matplotlib.org/examples/animation/animate_decay.html
在pipelines.py中实现MyFilesPipeline,代码如下：

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename,dirname,join

class MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path = urlparse(request.url).path
        return join(basename(dirname(path)),basename(path))

修改配置文件，使用MyFilesPipeline替代FilesPipeline：

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline':1,
    'matplotlib_examples.pipelines.MyFilesPipeline':1,
}

删除之前下载的所有文件，重新运行爬虫后，再观察examples_src目录
在这里插入图片描述
有错请指出，完整代码下载：https://github.com/QYQ323/python/tree/master

mt 2333

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
scrapy下载文件（《精通Scrapy网络爬虫》第九章）

爬取matplotlib例子源码文件1.需求分析下载http://matplotlib.org网站中所有例子的源码文件到本地。2.页面分析先来看看如何在例子页面https://matplotlib.org/examples/index.html中获取所有例子的链接。使用scrapy shell命令下载页面，然后调用view函数在浏览器中查看页面。scrapy shell https://...
复制链接

扫一扫

专栏目录