理论基础:官方文档——https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html?highlight=image
三个基本操作:1、在items.py中定义image_urls
和 image字段
2、
在setting.py中定义ITEM_PIPELINES和IMAGES_STORE
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/path/to/valid/dir'#文件存储地址比如IMAGES_STORE = 'data/斗鱼主播图片/'
借鉴一个成功的实例:https://www.cnblogs.com/pythonClub/p/9856490.html
自己手写的实例
items.py下边
import scrapy
class Imagedemo1Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls=scrapy.Field()
image=scrapy.Field()
爬虫.py
import scrapy
from imagedemo1.items import Imagedemo1Item
class DemoSpider(scrapy.Spider):
name = 'demo'
allowed_domains = ['www.baidu.com']
start_urls = ['https://gss3.bdstatic.com/7Po3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=a9a4a01c8a94a4c20e23e0293ef41bac/b64543a98226cffc613bfea1b4014a90f603ea94.jpg']
def parse(self, response):
item=Imagedemo1Item()
image_url="https://gss3.bdstatic.com/7Po3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=a9a4a01c8a94a4c20e23e0293ef41bac/b64543a98226cffc613bfea1b4014a90f603ea94.jpg"
item['image_urls']=[image_url]
yield item
setting.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'data/斗鱼主播图片/'
# 该字段的值为XxxItem中定义的存储图片链接的image_urls字段
IMAGES_URLS_FIELD='image_urls'
经常出问题的一点就是在爬虫.py文件中,在官方的图片链接提交的时候,提交image_urls字段的属性是迭代的,因此需要item['image_urls']=[image_url]。
自定义的爬虫管道参考:https://blog.csdn.net/cnmnui/article/details/99850055