利用scrapy爬取美图录网站图集按模特姓名存储到本地(三)

1.前言

前两篇的基础下,这次通过scrapy框架来完成图集爬取,充分利用scrapy自带的选择器进行页面解析。

2.根据scrapy的命令新建project

scrapy startproject meitulu
cd meitulu
scrapy genspider Image www.meitulu.com

3.由于scrapy运行命令为scrapy crawl Image故在一级目录meitulu下新建一个main.py文件,代码为:

from scrapy import cmdline
cmdline.execute('scrapy crawl image'.split())#分割成输入命令形式

4.编写items.py里面的字段,代码如下:

import scrapy
class MeituluItem(scrapy.Item):
    tag=scrapy.Field()#标签
    name=scrapy.Field()#姓名
    id=scrapy.Field()#图集编号
    img_url=scrapy.Field()#图片url,类型为列表

5.编写image.py主爬虫代码,如下:

# -*- coding: utf-8 -*-
import scrapy
from ..items import MeituluItem

class ImageSpider(scrapy.Spider):
    name = 'image'
    allowed_domains = ['www.meitulu.com']
    start_urls = ['https://www.meitulu.com/']

    def parse(self, response):
   		url_pattern='https://mtl.ttsqgs.com/images/img/{0}/{1}.jpg'#图片链接样式
        ids=[re.search('(\d+)',url).group(1) for url in response.css('ul.img li > a::attr(href)').extract()]
        #eg:['https://www.meitulu.com/item/14224.html', 'https://www.meitulu.com/item/5223.html', 'https://www.meitulu.com/item/3119.html']
        tags=[p.css('a::text').extract() for p in response.css('ul.img li p:nth-child(4)')]
        names=[item.css('a::text').extract()[0] if item.css('a::text').extract() else item.css('::text').extract()[0].strip('模特:') for item in response.css('ul.img li p:nth-child(3)')]
        totals=[int(re.search('(\d+)',item).group(1)) for item in response.css('ul.img li p:nth-child(2)::text').extract()]
        for id,tag,name,total in zip(ids,tags,names,totals):
            item = MeituluItem()
            item['name']=name
            item['id']=id
            item['tag']=tag
            item['img_url']=[url_pattern.format(id,str(i)) for i in range(1,total+1)]#构造图片url
            yield item

返回的item格式如下图片:

{'img_url': ['https://mtl.ttsqgs.com/images/img/17475/1.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/2.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/3.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/4.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/5.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/6.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/7.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/8.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/9.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/10.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/11.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/12.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/13.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/14.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/15.jpg',
             'https://mtl.ttsqgs.com/images/img/17475/16.jpg'],
 'name': '益坂美亚',
 'tag': ['优美', '沙滩'],
 'url': 'https://www.meitulu.com/item/17475.html'}

#6.编写pipelin.py里面的代码。由于需要自定义存储路径,所以需要重写ImagePipeline里面的两个函数如下:

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request

class MeiTuLuPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        meta={'name':item['name'],'tag':item['tag']}#图片文件夹名需要的参数
        return [Request(url=x,meta=meta) for x in item.get('img_url', [])]#这里一定要记得对应形参传值,否则报错

    def file_path(self, request, response=None, info=None):
        image_name=request.url.split('/')[-1]#分割图片链接
        #eg: 1.jpg
        # 接收上面meta传递过来的图片名称
        name = request.meta['name']
        tag="、".join(request.meta['tag'])
        dir='姓名:'+name+'---'+'标签:'+tag
        #构造文件名
        filename = '{0}/{1}'.format(dir,image_name)
        return filename

7.编写settings.py文件里面的代码,如下:

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY=1
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {
   'meitulu.middlewares.MeituluDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
   'meitulu.pipelines.MeiTuLuPipeline':300,#管道完成图片下载功能
}
project_dir=os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')#该字段会被scrapy识别,可以查看ImagesPipeline的源码知道

8.last but not the least,需要在下载中间键里面创建一个很重要的参数,否则无法完成下载图片,在MeituluDownloaderMiddleware更改成如下代码:

def process_request(self, request, spider):
    request.headers['Referer']='https://www.meitulu.com'
    return None

9.可以运行main文件查看,截图如下:

可以看见界面提示Download file from …即代表下载成功,查看project目录下image文件下多了很多文件夹如下(考虑审核问题,图片删除)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值