Scrapy下载文件

最新推荐文章于 2024-08-05 10:37:56 发布

AI柱子哥

最新推荐文章于 2024-08-05 10:37:56 发布

阅读量1.4w

点赞数 7

分类专栏： python

本文链接：https://blog.csdn.net/zhoulizhu/article/details/79108268

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Scrapy下载文件

Scrapy框架提供了两个Item Pipeline专门用来下载文件和图片：
* FilesPipeline
* ImagesPipeline
官方文档介绍
可以将他们看作是下载器，使用时通过item的特殊字段将需要下载的文件或图片传递给它们，它们会自动下载到你指定的文件夹，同时将结果存入item的另一个特殊字段，可以输出方便查阅。

爬取matplotlib

matplotlib是非常有用的作图库，其官网上提供了许多应用实例，可在’http://matplotlib.org/examples/index.html’ 查到，我们就把这些文件下载到本地，方便以后查找使用。

页面分析

例子的链接存在class=“toctree-wrapper compound”的div中的class=“toctree-l1”的li标签中，使用LinkExtractor提取方法可以很方便的提取到页面链接

from scrapy.linkextractors import Linkextractor  
le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l1',deny='/index.html$')  

#class中出现空格的地方用‘.’代替
#使用restrict_css和deny定位链接出现的地方

links = le.extract_links()
#可以提取出所有链接

进入例子页面，下载链接存在class=“reference external”的‘a’标签中，使用CSS方法提取出来。

href = response.css('a.reference.external::attr(href)').extract_first()

编码实现

创建项目和爬虫

>>>scrapy startproject matpl  
>>>cd matpl
>>>scrapy genspider matplot matplotlib.org

配置启用filepipeline

ITEM_PIPELINES={
    'scrapy.pipelines.files.FilePipeline':1,
    }
FILES_STORE = 'examples_src'  
}

配置MatpItem

import scrapy


class MatpItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    file_urls = scrapy.Field()
    file = scrapy.Field()

编写matplot Spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import MatpItem

class MatplotSpider(scrapy.Spider):
    name = "matplot"
    allowed_domains = ["matplotlib.org"]
    start_urls = ['http://matplotlib.org/examples/index.html']

    def parse(self, response):
        le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l1',deny='/index.html$')
        print(len(le.extract_links(response)))

        for link in le.extract_links(response):
            yield scrapy.Request(link.url,callback=self.parse_link)


    def parse_link(self,response):
        href = response.css('a.reference.external::attr(href)').extract_first()
        url = response.urljoin(href)
        matpl = MatpItem()
        matpl['file_urls'] = [url]
        return matpl

实现文件名保存

from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename,dirname,join

class MyFilePipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        path = urlparse(request.url).path
        return join(basename(dirname(path)),basename(path))

6.启用自定义filepipeline(这里是对第2步的修改)

ITEM_PIPELINES = {
#    'scrapy.pipelines.files.FilesPipeline':1,
    'matp.pipelines.MyFilePipeline':1,
    'matp.pipelines.MatpPipeline': 300,
}
FILES_STORE = 'example_src'

7.运行spider