根据文档所说,先创建item用来保存图片数据,为了能够使ImagesPipeLine生效,这个item需要有名为image_urls的field属性:
items.py
import scrapy
class MyItem(scrapy.Item):
image_urls = scrapy.Field()
image_paths = scrapy.Field()
images = scrapy.Field()
继承ImagesPipeLine编写自己的ImagesPipeLine
pipeline.py
import scrapy
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
class MyImageDownloadPipeLine(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
这里重写的item_completed用来在下载完成后保存image_path属性
3. 编辑settings.py使能MyImageDownloadPipeLine
settings.py
# coding=utf-8
BOT_NAME = 'imagedemo'
SPIDER_MODULES = ['imagedemo.spiders']
NEWSPIDER_MODULE = 'imagedemo.spiders'
# 使能ImagePipeLine
ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1}
# 指定图片文件保存的未知
IMAGES_STORE = 'image'
ROBOTSTXT_OBEY = True
编写spider实现爬虫逻辑
spider.py
# coding=utf-8
from scrapy.spiders import Spider
from imagedemo.items import MyItem
class ImageSpider(Spider):
name = 'hlhua'
start_urls = ['http://www.hlhua.com/']
def parse(self, response):
# inspect_response(response, self)
images = []
for each in response.xpath("//img[@class='goodsimg']/@src").extract():
m = MyItem()
m['image_urls'] = [each,]
images.append(m)
return images
执行scrapy crawl hlhua -o images.json,即可在image/full/下载图片,并生成images.json记录图片信息。