scrapy爬图片遇到的坑

最新推荐文章于 2023-06-06 10:52:09 发布

weixin_41956627

最新推荐文章于 2023-06-06 10:52:09 发布

阅读量570

点赞数

分类专栏： scrapy 文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_41956627/article/details/115376404

版权

scrapy 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、同一链接的图片被过滤，只下载了一次

由于项目需要爬取商品信息，不同商品信息中可能用到同一张图片链接，都需要保存在各自商品对应的目录下，用ImagesPipeline实现了一个自定义的pipeline，发现下载下来的图片明显少于实际数量，排查后发现是相同链接的图片只下载了一次，保存在其中一个商品对应的目录下，其他商品下相同的图片就没下载了

按理解应该是过滤掉了，不过我自定义了 get_media_requests 设置了dont_filter=True 依然没解决问题

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for img_url in item['img_url']:
         	yield scrapy.Request(img_url, meta=item, dont_filter=True)

看了好久的源码，各种尝试依然没解决，以为是有bug，导致dont_filter=True不生效。
突然看到有教程有句话 “避免重新下载最近下载的媒体”，原来源码就是这么实现的。
在这里插入图片描述

后来想到一个方案，给url加个随机无用的参数，来保证url不重复，也能正常下载图片，代码如下：

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for img_url in item['img_url']:
            # 原url添加无效参数，防止ImagesPipeline 把相同url的图片过滤了
            img_url = img_url + '?tt=%s_%s' % (str(time.time()), random.random())
            yield scrapy.Request(img_url, meta=item, dont_filter=True)

问题解决！

2、ImagesPipeline 不支持 SFTP

需要把下载的图片直接保存到服务器（SFTP）上，开始没注意ImagesPipeline只支持FTP，就按中文教程来搞了一遍，启动爬虫任务后，页面内容正常，但图片出不来，而且会卡好几分钟都没反应才结束，也没在哪个地方看到有报错，坑。。。
后来自己另起代码尝试，才发现是没连上服务器

研究源码后，自定义了个ImagesPipeline，解决了

首先 settings.py 里面配置如下

# 存储图片路径
R_HOST = 'x.x.x.x'   # 服务器地址
R_PORT = 22			# 端口
R_UN = 'xxxxx'		# 用户名
R_PW = 'xxxx'		# 密码
R_BASE_DIR = '/xxxx/imgs/'	# 存储路径

IMAGES_STORE = 'ftp://x.x.x:22/xxxxx/imgs'   # 这里ftp://开头就行，其他瞎写就行，写正确的反而会卡很久去连接，浪费时间
IMAGES_URLS_FIELD = 'img_url'

在pipelines.py中自定义一个ImagesPipeline

class MyImagesPipeline(ImagesPipeline):
    def __init__(self, store_uri, download_func=None, settings=None):
        super().__init__(store_uri, download_func=download_func, settings=settings)
		# 设置远程连接
        self.base_path = R_BASE_DIR
        self.transport = paramiko.Transport((R_HOST, R_PORT))
        self.transport.connect(username=R_UN, password=R_PW)
        self.sftp_client = paramiko.SFTPClient.from_transport(self.transport)
        myspider.logger.info('-------------Opened sftp connection-----------------')

    def image_downloaded(self, response, request, info, *, item=None):
        checksum = None
        for path, image, buf in self.get_images(response, request, info, item=item):
            if checksum is None:
                buf.seek(0)
                checksum = md5sum(buf)
            width, height = image.size

            self.image_save2SFTP(path, image)
            
            # 下面是源代码里面的，注释掉
            # self.store.persist_file(
            #     path, buf, info,
            #     meta={'width': width, 'height': height},
            #     headers={'Content-Type': 'image/jpeg'})
        return checksum

    def image_save2SFTP(self, img_path, image):
        paths = img_path.split('/')
        
        # 按paths路径在服务器上创建相关目录，根据你自己的paths写
        for i, p in enumerate(paths[:-1]):
            try:
                self.sftp_client.mkdir(self.base_path + '/'.join(paths[:i + 1]))
            except Exception as e1:
				# 可能目录已经存在了，报错就不管了
                continue
		
		# 保存图片到服务器
        try:
            remote_image = self.sftp_client.open(self.base_path + img_path, 'w')
            image.save(remote_image, 'JPEG')
            remote_image.close()
            myspider.logger.info('save_img_success, img_res:, %s ' % img_path)
        except Exception as e:
            myspider.logger.error('save_img_fail, %s, Error: %s' % (img_path, e))
        pass

最后把自定义的 MyImagesPipeline 在settings.py 里面激活就行了

weixin_41956627

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
scrapy爬图片遇到的坑

1、同一链接的图片被过滤，只下载了一次由于项目需要爬取商品信息，不同商品信息中可能用到同一张图片链接，都需要保存在各自商品对应的目录下，用ImagesPipeline实现了一个自定义的pipeline，发现下载下来的图片明显少于实际数量，排查后发现是相同链接的图片只下载了一次，保存在其中一个商品对应的目录下，其他商品下相同的图片就没下载了按理解应该是过滤掉了，不过我自定义了 get_media_requests 设置了dont_filter=True 依然没解决问题class MyImagesPipe
复制链接

扫一扫