scrapy官方默认使用的下载文件是:
settings.py:'scrapy.pipelines.files.FilesPipeline': 1 # FilesPipeline要置于其他pipeline之前
items.py: file_urls = scrapy.Field() file = scrapy.Field()
scrapy官方默认下载图片是
settings.py:'scrapy.pipelines.images.ImagesPipeline': 1
items.py
item必须有:
image_urls = scrapy.Field() image_name = scrapy.Field()
但如果我们之间使用官方定义的pipline则下载的文件名称是乱码,因为url进过哈希之后是一个字符串,所有我们要自定义我们的pipeline,查询源码发现,下载重命名则只需要重新file_path方法即可。下载图片、文件、视频都可以用下边这个pipeline去下载
区别:下载图片、文件不用设置heads,但下载视频必须要有Referer,不然会403爬不下来
当你的url在spider中传的是列表时,在 def_media_requests方法中的yield Request下方放开#for url in item即可
spider
需要添加下载路径:
FILES_STORE = "C:/Users/asus/Desktop/zip/pdf"
IMAGES_STORE = "C:/Users/asus/Desktop/zip/ddd"
下载视频:
#1、下载视频
start_urls = [
'https://video.eastday.com/a/200712033214444291719.html',
'http://www.bjft.gov.cn/zfxxgk/ftq11GG01/xxgkmlxz/ftbm_list.shtml',
'http://www.bjyq.gov.cn/yanqing/zbm/zfb80/ldbz95/index.shtml'
]
# basic_url = 'https://shortmvpc.eastday.com'
def start_requests(self):
heads = {
'Referer': 'https://video.eastday.com/a/200712033214444291719.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
yield scrapy.Request(url=self.start_urls[0], headers=heads, callback=self.parse, meta={"type": "video"})
def parse(self, response):
if response.meta['type'] == 'video':
yield scrapy.Request(response.url, callback=self.parse_video)
def parse_video(self, response):
item = DpdfItem()
# 提取 url 组装成列表,并赋给 item 的 file_urls 字段
urls = response.xpath("//input[@id='mp4Source']")
for url in urls:
# file file_urls
item['file'] = "视频"
item['file_urls'] = 'https:' + url.xpath("./@value").extract_first()
yield item
Referer也可以写在setting配置文件中