1.前言
前两篇的基础下,这次通过scrapy框架来完成图集爬取,充分利用scrapy自带的选择器进行页面解析。
2.根据scrapy的命令新建project
scrapy startproject meitulu
cd meitulu
scrapy genspider Image www.meitulu.com
3.由于scrapy运行命令为scrapy crawl Image
故在一级目录meitulu下新建一个main.py文件,代码为:
from scrapy import cmdline
cmdline.execute('scrapy crawl image'.split())#分割成输入命令形式
4.编写items.py里面的字段,代码如下:
import scrapy
class MeituluItem(scrapy.Item):
tag=scrapy.Field()#标签
name=scrapy.Field()#姓名
id=scrapy.Field()#图集编号
img_url=scrapy.Field()#图片url,类型为列表
5.编写image.py主爬虫代码,如下:
# -*- coding: utf-8 -*-
import scrapy
from ..items import MeituluItem
class ImageSpider(scrapy.Spider):
name = 'image'
allowed_domains = ['www.meitulu.com']
start_urls = ['https://www.meitulu.com/']
def parse(self, response):
url_pattern='https://mtl.ttsqgs.com/images/img/{0}/{1}.jpg'#图片链接样式
ids=[re.search('(\d+)',url).group(1) for url in response.css('ul.img li > a::attr(href)').extract()]
#eg:['https://www.meitulu.com/item/14224.html', 'https://www.meitulu.com/item/5223.html', 'https://www.meitulu.com/item/3119.html']
tags=[p.css('a::text').extract() for p in response.css('ul.img li p:nth-child(4)')]
names=[item.css('a::text').extract()[0] if item.css('a::text').extract() else item.css('::text').extract()[0].strip('模特:') for item in response.css('ul.img li p:nth-child(3)')]
totals=[int(re.search('(\d+)',item).group(1)) for item in response.css('ul.img li p:nth-child(2)::text').extract()]
for id,tag,name,total in zip(ids,tags,names,totals):
item = MeituluItem()
item['name']=name
item['id']=id
item['tag']=tag
item['img_url']=[url_pattern.format(id,str(i)) for i in range(1,total+1)]#构造图片url
yield item
返回的item格式如下图片:
{'img_url': ['https://mtl.ttsqgs.com/images/img/17475/1.jpg',
'https://mtl.ttsqgs.com/images/img/17475/2.jpg',
'https://mtl.ttsqgs.com/images/img/17475/3.jpg',
'https://mtl.ttsqgs.com/images/img/17475/4.jpg',
'https://mtl.ttsqgs.com/images/img/17475/5.jpg',
'https://mtl.ttsqgs.com/images/img/17475/6.jpg',
'https://mtl.ttsqgs.com/images/img/17475/7.jpg',
'https://mtl.ttsqgs.com/images/img/17475/8.jpg',
'https://mtl.ttsqgs.com/images/img/17475/9.jpg',
'https://mtl.ttsqgs.com/images/img/17475/10.jpg',
'https://mtl.ttsqgs.com/images/img/17475/11.jpg',
'https://mtl.ttsqgs.com/images/img/17475/12.jpg',
'https://mtl.ttsqgs.com/images/img/17475/13.jpg',
'https://mtl.ttsqgs.com/images/img/17475/14.jpg',
'https://mtl.ttsqgs.com/images/img/17475/15.jpg',
'https://mtl.ttsqgs.com/images/img/17475/16.jpg'],
'name': '益坂美亚',
'tag': ['优美', '沙滩'],
'url': 'https://www.meitulu.com/item/17475.html'}
#6.编写pipelin.py里面的代码。由于需要自定义存储路径,所以需要重写ImagePipeline里面的两个函数如下:
from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
class MeiTuLuPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
meta={'name':item['name'],'tag':item['tag']}#图片文件夹名需要的参数
return [Request(url=x,meta=meta) for x in item.get('img_url', [])]#这里一定要记得对应形参传值,否则报错
def file_path(self, request, response=None, info=None):
image_name=request.url.split('/')[-1]#分割图片链接
#eg: 1.jpg
# 接收上面meta传递过来的图片名称
name = request.meta['name']
tag="、".join(request.meta['tag'])
dir='姓名:'+name+'---'+'标签:'+tag
#构造文件名
filename = '{0}/{1}'.format(dir,image_name)
return filename
7.编写settings.py文件里面的代码,如下:
USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY=1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {
'meitulu.middlewares.MeituluDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'meitulu.pipelines.MeiTuLuPipeline':300,#管道完成图片下载功能
}
project_dir=os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')#该字段会被scrapy识别,可以查看ImagesPipeline的源码知道
8.last but not the least,需要在下载中间键里面创建一个很重要的参数,否则无法完成下载图片,在MeituluDownloaderMiddleware更改成如下代码:
def process_request(self, request, spider):
request.headers['Referer']='https://www.meitulu.com'
return None
9.可以运行main文件查看,截图如下:
可以看见界面提示Download file from …即代表下载成功,查看project目录下image文件下多了很多文件夹如下(考虑审核问题,图片删除)