手机壁纸-crawlspider-图片管道存储

TecFy

于 2021-08-04 14:03:10 发布

阅读量148

点赞数

分类专栏：爬虫01

本文链接：https://blog.csdn.net/qq_53859679/article/details/119382459

版权

爬虫01 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

手机壁纸-crawlspider-图片管道存储
目标url：https://www.3gbizhi.com/sjbz/index_1.html
分析：
- 使用crawlspider创建工程文件
- 使用链接提取器提取单页所有的图片的url
- LinkExtractor(restrict_xpaths=’/html/body/div/ul/li/a’)
- 这里使用了xpath语法，在Crawlspider使用xpath时只需要指定url所对应的父节点即可
- 使用链接提取器实现翻页
- link_next = LinkExtractor(restrict_xpaths=’//a[text()=“下一页”]’)
- 这里使用里xpath中的text()属性的方法进行，通过点击“下一页的方法翻页”。

- 使用两个链接提取器提取图片的url ，和翻页的url
- 使用规则解析器解析出图片的url，和图片的名称
     - 使用两个规则
        - 解析图片url的规则
             -  Rule(link_img, callback='parse_item'),
                               - callback为解析方法，crawlspider自带
        - 解析翻页的规则
            - Rule(link_next, follow=True)
                - 因为时翻页，并且每一夜的所有解析的内容全部是一样的，所有不需要指定解析方法，
                - follow=True 代表一直翻页，直到没有页码。
- 使用图片管道保存图片数据
    - 导入模块
        - from scrapy.pipelines.images import ImagesPipeline
        - from scrapy.pipelines.images import ImagesPipeline
    - 自定义图片管道类，继承ImagesPipeline类
        - class Images3GProPipeline(ImagesPipeline):
            # 重写请求父类方法
            def get_media_requests(self, item, info):
                - 对图片的url发送请求，构建resquest对象并提交
                yield scrapy.Resquest(item['img_src'])
                # 这里不需要回调函数，只需要对穿传递过来的item对象中的url发送请求即可
            # 重写保存路径的方法
            def file_path(self, request, response=None, info=None, *, item=None):
                - 通过传递过来的item对象中的名称字段为图片命名，这里不需要接受item参数了，因为此方法中已经有item
                img_title = item['img_title']
                # 保存路径
                return f'imgs3G/{img_title}.jpg'
            # 如果要继续返回item话，可以重写返回item的父类方法
            def item_completed(self, results, item, info):

                return item # 直接返回即可

上代码

01-img3g爬虫py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import Imgpro3GItem

class Img3gSpider(CrawlSpider):
    name = 'img3g'
    # allowed_domains = ['xxxx.com']
    start_urls = ['https://www.3gbizhi.com/sjbz/index_1.html']
    # 提取每张图片所在的url链接
    link_img = LinkExtractor(restrict_xpaths='/html/body/div/ul/li/a')
    # 翻页链接
    link_next = LinkExtractor(restrict_xpaths='//a[text()="下一页"]')

    rules = (
        Rule(link_img, callback='parse_item'),
        Rule(link_next, follow=True),
    )

    def parse_item(self, response):
        """此方法解析图片的src属性"""
        # print(response)
        # 图片的url
        img_src = response.xpath('//*[@id="showimg"]/a[4]/img/@src').get()
        # print(img_src)
        # 图片名称：
        img_title = response.xpath('//*[@id="showimg"]/a[4]/img/@alt').get()
        # print(img_title)
        item = Imgpro3GItem()
        item['img_src'] = img_src
        item['img_title'] = img_title
        yield item

02-item.py

import scrapy


class Imgpro3GItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_src = scrapy.Field()
    img_title = scrapy.Field()

03- settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
 # 根据情况自定
    'cookie':'Hm_lvt_c8263f264e5db13b29b03baeb1840f60=1627806968,1627975248; __utma=176174951.306872594.1627978851.1627978851.1627978851.1; __utmc=176174951; __utmz=176174951.1627978851.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; __utmb=176174951.5.10.1627978851; wzws_cid=288eb72637132664b3aad3fdee8f42660af187c064d6c24946be01593cd27014caa9a8ff68149136a1a91fbea25639b940a6ca03de3e4bc6626e8c3731bcbb63613cec5f985deafb0af3d815e82c1702; Hm_lpvt_c8263f264e5db13b29b03baeb1840f60=1627978891'

}
ITEM_PIPELINES = {
   # 'ImgPro3G.pipelines.Imgpro3GPipeline': 300,
    'ImgPro3G.pipelines.Imgepro3GPiplines':301
}

04-pipeline.py

from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
import scrapy


# class Imgpro3GPipeline:
#     def process_item(self, item, spider):
#         print(item)
#         return item

# 定义图片管道类，存储图片
class Imgepro3GPiplines(ImagesPipeline):
    # 对图片的url发送请求
    def get_media_requests(self, item, info):
  		# 发送请求构造Request对象，提交给下个方法，不需要回调函数
        yield scrapy.Request(item['img_src'])

    # 修改保存路径
    def file_path(self, request, response=None, info=None, *, item=None):

        img_title = item['img_title']
        return f'imgs3G/{img_title}.jpg'

TecFy

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
手机壁纸-crawlspider-图片管道存储

手机壁纸-crawlspider-图片管道存储目标url：https://www.3gbizhi.com/sjbz/index_1.html分析：- 使用crawlspider创建工程文件- 使用链接提取器提取单页所有的图片的url- LinkExtractor(restrict_xpaths=’/html/body/div/ul/li/a’)- 这里使用了xpath语法，在Crawlspider使用xpath时只需要指定url所对应的父节点即可- 使用链接提取器实现翻页- link_nex
复制链接

扫一扫