目录
前言
本文主要介绍如何使用CrawlSpider进行全站图片爬取,示例网站。
一、CrawlSpider是什么?
CrawlSpider是Python中的一个爬虫框架,它是Scrapy框架中的一个子类。CrawlSpider可用于编写基于规则的爬虫程序,通过定义规则来提取网页中的数据。
CrawlSpider可以自动发现并跟踪网站上的链接,从而实现自动化爬取数据的目的。同时,它还支持将爬取到的数据保存到本地文件或数据库中,并支持异步处理和分布式部署等高级功能。
二、流程展示
1.创建一个新的Scrapy项目
终端输入
scrapy startproject netbian
2.进入该项目目录,创建一个Spider
终端输入
cd netbian
scrapy genspider netbian_spider www.xxx.com
3.netbian_spider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from netbian.items import NetbianItem
class NetbianSpider(CrawlSpider):
name = "netbian_spider"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://pic.netbian.com/4kmeinv/"]
# 实例化一个链接提取器对象,作用在规则解析器中
rules = (Rule(LinkExtractor(allow=r"/4kmeinv/index_\d+"), callback="parse", follow=True),)
def parse(self, response, **kwargs):
print(response)
li_list = response.xpath('//*[@id="main"]//ul[@class="clearfix"]/li')
for li in li_list:
item = NetbianItem()
item['image_name'] = li.xpath('./a/b/text()').extract_first() + '.jpg'
item['image_url'] = 'https://pic.netbian.com' + li.xpath('./a/img/@src').extract_first()
yield item
代码解释:
name
:Spider的名称。allowed_domains
:允许爬取的域名。start_urls
:开始爬取的URL。parse
:处理响应的回调函数,用于解析页面和提取数据。response.xpath
:使用XPath表达式选择器获取页面元素。yield
:生成数据并将其传递到Item Pipeline进行处理。
4.settings.py
BOT_NAME = "netbian"
SPIDER_MODULES = ["netbian.spiders"]
NEWSPIDER_MODULE = "netbian.spiders"
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
"netbian.pipelines.ImgPipeline": 300,
}
IMAGES_STORE = './images'
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
5.pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
class ImgPipeline(ImagesPipeline):
# 可以根据图片地址进行图片数据的请求
def get_media_requests(self, item, info):
yield scrapy.Request(item['image_url'])
# 确定图片名称
def file_path(self, request, response=None, info=None, *, item=None):
# img_name = request.url.split('/')[-1]
img_name = item['image_name']
return img_name
def item_completed(self, results, item, info):
return item # 返回给下一个即将被执行的管道类
6.items.py
import scrapy
class NetbianItem(scrapy.Item):
image_url = scrapy.Field()
image_name = scrapy.Field()
三、查看结果
终端输入:
scrapy crawl netbian_spider
总结
以上就是今天要讲的内容,本文介绍了如何使用CrawlSpider类,并成功爬取图片。