重温了一下爬取图片的管道,同时发现了-t crawl格式的新用法,特意记录下来,爬取网站是 enterdesk。
说道新用法,主要是rules的用法,之前一直以为rule只能管理当前页,不能提取下级下下级的链接,还是 too young, sometimes naive, 嘿嘿。
rules是可以支持下级页面的爬取的,但是建议将主页面放在最下,次级页面次之,最下级页面最上,也就是反过来排序,如下面的代码。
当前我们只需要在最下级页面中提取url字段并拿给图像管道,因此,只需要一个parse函数即可。
爬虫文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import Wallpaper2Item
class PrettySpider(CrawlSpider):
name = 'pretty'
allowed_domains = ['www.enterdesk.com']
start_urls = ['https://www.enterdesk.com/special/wmtp/']
rules = (
Rule(LinkExtractor(allow=r'//www\.enterdesk\.com/download/\d+-\d+/'), callback='parse_item', follow=False),
# 下载页找图片地址,需要解析callback
Rule(LinkExtractor(allow=r'/bizhi/\d+-\d+\.html'), follow=True),
# # 详情页找下载页,不需要callback
Rule(LinkExtractor(allow=r'https://www\.enterdesk\.com/bizhi/\d+.html'), follow=True),
# 详情页,不需要callback
)
def parse_item(self, response):
item = Wallpaper2Item()
item['image_urls'] = response.xpath('//img[@id="down_main_pic"]/@src').extract()
print(item)
yield item
setting中需要修改一下ua,以及开启图像管道,存储图像的位置
LOG_LEVEL = "WARNING"
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
# 'wallpaper2.pipelines.Wallpaper2Pipeline': 300,
'scrapy.pipelines.images.ImagesPipeline':300,
# from scrapy.pipelines.images import ImagesPipeline 可以这样找imagespipeline的位置
}
IMAGES_STORE = 'img'
items中开启
import scrapy
class Wallpaper2Item(scrapy.Item):
# define the fields for your item here like:
image_urls = scrapy.Field()
images = scrapy.Field()
无需更多定制即可开爬,结果如下: