1. 先写一个特殊的Item
class CSDNImgItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
注意这个字段是写死的image_urls 是图片的地址的一个数组,images记录图片信息不用管。
2.yield item
image_urls = response.css('#cnblogs_post_body img::attr("src")').extract()
if len(image_urls) > 0:
imageItem = CSDNImgItem()
imageItem['image_urls'] = image_urls
yield imageItem
这样写即可记得image_urls是一个数组
3. 图片下载pipeline
from scrapy.pipelines.images import ImagesPipeline
class CSDNImgPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if 'image_urls' in item.keys():
for image_url in item['image_urls']:
last_name = image_url[image_url.rfind('/') + 1:len(image_url)]
yield Request(image_url, meta={'name': last_name})
def file_path(self, request, response=None, info=None):
today = datetime.datetime.now().strftime('%Y%m%d')
name = ''
if '.' in request.meta['name']:
name = request.meta['name'][0:request.meta['name'].rindex('.')]
else:
name = request.meta['name']
result = "%s/%s.jpg" % (today, name)
return result
pass
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
# if not image_paths:
# raise DropItem("Item contains no images")
return item
我这里对不是图片的item没有进行dropitem的处理,因为有可能是其他的文字item。然后重写file_path,我这边是以201804/123456.jpg 这样的格式命名的。
4.图片下载settings
ITEM_PIPELINES =
'tutorial.pipelines.CSDNImgPipeline':400,
}
IMAGES_STORE ='/Users/walle/PycharmProjects/imgSave/img'
定义pipeline和图片保存地址即可。
scrapy 普通的抓取可以参考这篇文章 《使用scrapy 把爬到的数据保存到mysql 防止重复》