scrapy python下载图片_python爬虫框架scrapy学习图片下载

根据文档所说,先创建item用来保存图片数据,为了能够使ImagesPipeLine生效,这个item需要有名为image_urls的field属性:

items.py

import scrapy

class MyItem(scrapy.Item):

image_urls = scrapy.Field()

image_paths = scrapy.Field()

images = scrapy.Field()

继承ImagesPipeLine编写自己的ImagesPipeLine

pipeline.py

import scrapy

from scrapy.exceptions import DropItem

from scrapy.pipelines.images import ImagesPipeline

class MyImageDownloadPipeLine(ImagesPipeline):

def get_media_requests(self, item, info):

for image_url in item['image_urls']:

yield scrapy.Request(image_url)

def item_completed(self, results, item, info):

image_paths = [x['path'] for ok, x in results if ok]

if not image_paths:

raise DropItem("Item contains no images")

item['image_paths'] = image_paths

return item

这里重写的item_completed用来在下载完成后保存image_path属性

3. 编辑settings.py使能MyImageDownloadPipeLine

settings.py

# coding=utf-8

BOT_NAME = 'imagedemo'

SPIDER_MODULES = ['imagedemo.spiders']

NEWSPIDER_MODULE = 'imagedemo.spiders'

# 使能ImagePipeLine

ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1}

# 指定图片文件保存的未知

IMAGES_STORE = 'image'

ROBOTSTXT_OBEY = True

编写spider实现爬虫逻辑

spider.py

# coding=utf-8

from scrapy.spiders import Spider

from imagedemo.items import MyItem

class ImageSpider(Spider):

name = 'hlhua'

start_urls = ['http://www.hlhua.com/']

def parse(self, response):

# inspect_response(response, self)

images = []

for each in response.xpath("//img[@class='goodsimg']/@src").extract():

m = MyItem()

m['image_urls'] = [each,]

images.append(m)

return images

执行scrapy crawl hlhua -o images.json,即可在image/full/下载图片,并生成images.json记录图片信息。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值