Scrapy_使用_自定义ImagePipeline，自定义图片名

多余的咸鱼

已于 2023-03-24 09:38:35 修改

阅读量287

点赞数

文章标签：爬虫 scrapy python pillow

于 2023-03-23 23:59:19 首次发布

本文链接：https://blog.csdn.net/m0_62778558/article/details/129711790

版权

一、基本使用

二、自定义ImagesPipeline类

参考网站：https://www.osgeo.cn/scrapy/topics/media-pipeline.html#using-the-images-pipeline

一、基本使用

1、创建项目

2、提取图片地址

这里我们需要注意的有两点

image_urls这个字典名字在这里是默认的

url_list的数据类型是list（列表）类型，所以说如果你提取的是一张图片的地址，记得[url_list]变成列表类型

import scrapy


class ZolSpider(scrapy.Spider):
    name = "zol"
    allowed_domains = ["zol.com"]
    # 这个是我们演示的目标网站
    start_urls = ["https://desk.zol.com.cn/bizhi/10008_119950_2.html"]

    def parse(self, response):
        # 我们以xpath作为信息筛选的方式
        url_list = response.xpath("//*[@id='showImg']/li/a/img/@src").getall()

        return {'image_urls':url_list}
    

from scrapy.pipelines.images import ImagesPipeline

3、配饰settings.py

注释掉这一段

# ROBOTSTXT_OBEY = True

开启通道，配置图片存储位置

ITEM_PIPELINES = {
#    "Scrapy_ImagePipeline.pipelines.ScrapyImagepipelinePipeline": 300,
   "scrapy.pipelines.images.ImagesPipeline":300
}

IMAGES_STORE = "E:\CS\demo\Scrapy_ImagePipeline\Scrapy_ImagePipeline\spiders\imgs"

5、创建脚本文件执行

from scrapy.cmdline import execute

execute(['scrapy','crawl','zol'])

执行后的结果为

二、自定义ImagesPipeline类，自定义图片名保存

1、获取图片名字，我们再原有的代码上添加就好

import scrapy


class ZolSpider(scrapy.Spider):
    name = "zol"
    allowed_domains = ["zol.com"]
    # 这个是我们演示的目标网站
    start_urls = ["https://desk.zol.com.cn/bizhi/10008_119950_2.html"]

    def parse(self, response):
        # 我们以xpath作为信息筛选的方式
        url_list = response.xpath("//*[@id='showImg']/li/a/img/@src").getall()
        url_name = response.xpath('//*[@id="showImg"]/li/i/em/text()').getall()
        for url,name in zip(url_list,url_name):
            yield {'image_urls':[url],"image_name":name}
    

from scrapy.pipelines.images import ImagesPipeline

2、自定义ImagesPipeline

from scrapy.pipelines.images import ImagesPipeline

class Myimgfile(ImagesPipeline):
    # 这个函数用于修改图片地址传递时的key
    def get_media_requests(self, item, info):
        return super().get_media_requests(item, info)
    # 此函数用于修改保存图片时修改图片名
    def file_path(self, request, response=None, info=None, *, item=None):
        name = item.get("image_name")
        return f'myfiles/{name}.jpg'

3、修改通道

ITEM_PIPELINES = {
#    "Scrapy_ImagePipeline.pipelines.ScrapyImagepipelinePipeline": 300,
   "Scrapy_ImagePipeline.pipelines.Myimgfile": 300,
#    "scrapy.pipelines.images.ImagesPipeline":300,
}

4、开始爬取，最后结果

这里我们关注的也是左侧的文件名和地址