爬虫入门-5-2.scrapy框架下载图片-CSDN博客

scrapy startproject bmw

cd bmw

scrapy genspider bmw5 'autohome.com.cn'

第一种方式:不使用ImagePipeline

bww5.py:

 1 import scrapy
 2 from bmw.items import BmwItem
 3 
 4 
 5 class Bmw5Spider(scrapy.Spider):
 6     name = 'bmw5'
 7     allowed_domains = ['autohome.com.cn']
 8     start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
 9 
10     def parse(self, response):
11         uiboxs = response.xpath('//div[@class = "uibox"]')[1:]
12         for uibox in uiboxs:
13             category = uibox.xpath('.//div[@class = "uibox-title"]/a/text()').get()
14             urls = uibox.xpath('.//ul/li/a/img/@src').getall()
15             urls = list(map(lambda url: response.urljoin(url), urls))
16             item = BmwItem(category=category, urls=urls)
17             yield item

items.py:

1 import scrapy
2 
3 
4 class BmwItem(scrapy.Item):
5     # define the fields for your item here like:
6     # name = scrapy.Field()
7     category=scrapy.Field()
8     urls=scrapy.Field()

settings.py部分设置:

1 ITEM_PIPELINES = {    
2      'bmw.pipelines.BmwPipeline': 300,
3 }

pipelines.py:

 1 import os
 2 from urllib import request
 3 
 4 class BmwPipeline(object):
 5     def __init__(self):
 6         self.path = os.path.join(os.path.dirname(__file__), 'images')
 7         if not os.path.exists(self.path):
 8             os.mkdir(self.path)
 9 
10     def process_item(self, item, spider):
11         category = item['category']
12         urls = item['urls']
13         category_path = os.path.join(self.path, category)
14         if not os.path.exists(category_path):
15             os.mkdir(category_path)
16         for url in urls:
17             image_name = url.split('_')[-1]
18             request.urlretrieve(url, os.path.join(category_path, image_name))
19         return item

第二种:通过ImagesPipeline来保存图片

步骤:

1.定义好一个Item,然后在这个item中定义两个属性,分别为:image_urls和images
  images_urls是用来存储需要下载的图片的url链接,需要给一个列表
2.当文件下载完成后,会把文件下载相关信息存储到item的images属性中,比如下载路径,下载的url和图片的校验码等
3.在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置图片下载下来的路径
  在配置文件settings.py中配置IMAGES_URLS_FIELD,这个配置是设置图片路径的item字段名
  (注:特别重要,不然图片文件夹为空)
4.启动pipeline:在ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1

改写pipelines.py:

 1 import os
 2 from scrapy.pipelines.images import ImagesPipeline
 3 from bmw import settings
 4 
 5 class BMWImagesPipeline(ImagesPipeline):  # 继承ImagesPipeline
 6     # 该方法在发送下载请求前调用，本身就是发送下载请求的
 7     def get_media_requests(self, item, info):
 8         request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info)  # super()直接调用父类对象
 9         for request_object in request_objects:
10             request_object.item = item
11         return request_objects
12 
13     def file_path(self, request, response=None, info=None):
14         path = super(BMWImagesPipeline, self).file_path(request, response, info)
15         # 该方法是在图片将要被存储时调用，用于获取图片存储的路径
16         category = request.item.get('category')
17         images_stores = settings.IMAGES_STORE  # 拿到IMAGES_STORE
18         category_path = os.path.join(images_stores, category)
19         if not os.path.exists(category_path):  # 判断文件名是否存在,如果不存在创建文件
20             os.mkdir(category_path)
21         image_name = path.replace('full/', '')
22         image_path = os.path.join(category_path, image_name)
23         return image_path

改写pipelines.py:(第二种方式)

 1 from scrapy.pipelines.images import ImagesPipeline
 2 from scrapy import Request
 3 
 4 class MyCarImagePipeline(ImagesPipeline):
 5     def get_media_requests(self, item, info):
 6         for url in item['urls']:  # item中urls字段存的是列表
 7             yield Request(url, meta={"item": item})
 8 
 9      def file_path(self, request, response=None, info=None):
10         path = super(MyCarImagePipeline, self).file_path(request, response, info)
11         item = request.meta['item']
12         category = item['category']
13         image_name = path.replace('full/', '')  # 图片名为随机
14         # image_name = item['img_name'][0] 如果item有这个字段是作为图片名,那可以这么取
15         return './%s/%s.jpg' % (category, image_name)

改写settings.py:

1 import os
2 IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'imgs')
3 IMAGES_URLS_FIELD='urls'

4 ITEM_PIPELINES = {
5 'bmw.pipelines.BMWImagesPipeline': 1,
  }

pycharm运行scrapy需要在项目文件夹下新建一个start.py:

1 from scrapy import cmdline
2 
3 cmdline.execute(['scrapy', 'crawl', 'bmw5'])

转载于:https://www.cnblogs.com/min-R/p/10545408.html