关于scrapy继承FilePipeline自建pipeline，下载文件重命名的办法以及对应url没有文件后缀的办法

最新推荐文章于 2024-05-01 19:58:31 发布

Yew1168

最新推荐文章于 2024-05-01 19:58:31 发布

阅读量1.9k

点赞数

分类专栏： Python

Python 专栏收录该内容

22 篇文章 2 订阅

订阅专栏

https://www.cnblogs.com/pythonClub/p/9858830.html

由于网页一般会将想要请求的文件类型放在response的头部信息 content-type里，我们可以通过获取content-type信息，在进行相应的操作。这样我们就需要找到调用file_path的函数

def file_downloaded(self, response, request, info):

path = self.file_path(request, response=response, info=info)

buf = BytesIO(response.body)

checksum = md5sum(buf)

buf.seek(0)

self.store.persist_file(path, buf, info)

return checksum

在file_downloaded里，第一行就是调用了file_path函数，而且根据命名规则，十分清晰。我们只要对上述path 做一定的修改即可。
因为file_downloaded是对文件进行下载，而file_path是对文件进行存储路径的安排的，所以file_downloaded这里的response我们是可以获取相关信息的。
获取重定向后文件后缀的方法为:
response.headers.get('Content-Disposition') 或者 response.headers.get('Content-Type') ，如果获取不到，可以改成content-disposition 或者 content-type,举个例子
content-disposition可能得到的是这个：
Content-Disposition: inline;filename=Vet%20Contract%20for%20Services.pdf，split分割

    def file_downloaded(self, response, request, info):
        #path = self.file_path(request, response=response, info=info)
        #path=response.headers.get('Content-Disposition')
        #print(response.headers.get('Content-Disposition').decode("gb2312").split('=')[1])
        path=response.headers.get('Content-Disposition').decode("gb2312").split('=')[1]
        buf = BytesIO(response.body)
        checksum = md5sum(buf)
        buf.seek(0)
        self.store.persist_file(path, buf, info)
        return checksum

$\color{red}{Content-Disposition}$ 是一个扩展协议，对得到的内容进行正则处理后，可以得到后缀，一般建议先用这个。但有的并不支持这种协议
$\color{red}{Content-Type}$ 一般网站都是支持的，但是它返回的文件类型可能没法直接使用，所以建议先使用上面的那个

但是有一个问题，如果想要下载的文件的url是经过重定向，或者对应的url没有后缀呢。
由于网页一般会将想要请求的文件类型放在response的头部信息 content-type里，我们可以通过获取content-type信息，在进行相应的操作。这样我们就需要找到调用file_path的函数

Yew1168

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
关于scrapy继承FilePipeline自建pipeline，下载文件重命名的办法以及对应url没有文件后缀的办法

https://www.cnblogs.com/pythonClub/p/9858830.html由于网页一般会将想要请求的文件类型放在response的头部信息 content-type里，我们可以通过获取content-type信息，在进行相应的操作。这样我们就需要找到调用file_path的函数 1 2 3 4 5 6 7 ...
复制链接

扫一扫

专栏目录