【爬虫8】——CrawlSpider

珊珊而川

已于 2023-07-20 17:47:07 修改

阅读量575

点赞数

分类专栏：爬虫文章标签：爬虫

于 2023-07-20 17:46:40 首次发布

本文链接：https://blog.csdn.net/weixin_63681863/article/details/131827713

版权

爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

一、CrawlSpider类

是spider的子类，用于全站数据爬取（就是爬所有页码

全站数据爬取方式：

法1：基于spider手动请求

法2：基于CrawlSpider

使用：

1.创建工程

2.cd xxx

3.创建爬虫文件（CrawlSpider）

scrapy genspider -t crawl xxx(爬虫文件名称 www.xxx.com

（1）链接提取器：根据指定规则，进行指定提取(allow="正则")，提取指定链接

（2）规则解析器：将链接器提取器提取到的链接进行指定规则（callback）的解析操作

【实战1】——爬取book

(阳光网封ip了我没搞代理督办回复-阳光热线问政平台

需求：爬取书名和详细介绍

分析：爬取的数据不在同一张页面中

1.使用链接提取器提取所用页码链接

2.让链接提取器提取所有的新闻详情页的链接

所以，要2个链接

xpath表达式中不可以出现tbody标签

sun.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import Sunpro2Item
from ..items import DetailItem

class SunSpider(CrawlSpider):
    name = "sun"
    # allowed_domains = ["www.xx.com"]
    start_urls = ["http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html"]
    # 链接提取器：根据指定规则，进行指定提取(allow="正则")，提取指定链接
    link = LinkExtractor(allow=r"page-\d+.html")

    link_detail = LinkExtractor(deny=r".*/books/.*")
    # ^ (?!.* book). *$
    rules = (
        # 规则解析器：将链接器提取器提取到的链接进行指定规则（callback）的解析操作
        Rule(link, callback="parse_item", follow=True),
        # follow=True ： 可以将链接提取器 继续作用作用到 链接提取器提取到的链接 所对应的页面中
        # 类似于递归
        Rule(link_detail, callback="parse_detail", follow=False),

    )

    # 如下2个方法是不能实现请求传参的！！！！
    # 无法将2个解析方法 解析的数据 存储到同一个item中，可以一次存储到2个item
    def parse_item(self, response):
        all_name = response.xpath('//article[@class="product_pod"]/h3/a/@title').extract()

        # print("___________________________")
        # print(response)
        # print('name:',all_name)
        for i in range(len(all_name)):
            item = Sunpro2Item()
            item['name']=all_name[i]
            # print(item['name'])

            yield item

    # 解析书详情页
    def parse_detail(self,response):
        passage=response.xpath('//article[@class="product_page"]/p/text()').extract()
        # print("+++++++++++++++++++++++++++")
        # print(response)

        item=DetailItem()
        item['passage']=passage
        yield item

        # print('passage',passage)
#

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class Sunpro2Pipeline:
    def process_item(self, item, spider):
        # 在管道中区分不同类型的item
        # 将数据库写入数据库时，如何保证数据的一致性
        # 2个item有相同的id值,再进行插入操作

        if item.__class__.__name__ == 'DetailItem':
            print('passage:',item['passage'])
        else:
            print('name:',item['name'])

        return item

items.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class Sunpro2Pipeline:
    def process_item(self, item, spider):
        # 在管道中区分不同类型的item
        # 将数据库写入数据库时，如何保证数据的一致性
        # 2个item有相同的id值,再进行插入操作

        if item.__class__.__name__ == 'DetailItem':
            print('passage:',item['passage'])
        else:
            print('name:',item['name'])

        return item

二、分布式爬虫

概念：搭建分布式集群，让其对一组资源进行分布式联合爬取

作用：提升爬取数据的效率

实现步骤：

安装scrapy-redis

不写了用不上

三、增量爬虫

概念：检测网站数据更新情况，只会爬取网站更新的数据

分析：

1.指定一个起始url

2.基于Rule将其他页码链接进行请求

3.从每个页码对应的页面源码中解析出每一个电影详情页URL

检测电影详情页的url之前是否请求过

将爬取过的详情页url存储，存到redis的set数据

4.对详情页url发起请求，然后解析出电影名称和简介

5.进行持久化存储

珊珊而川

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【爬虫8】——CrawlSpider

将爬取过的详情页url存储，存到redis的set数据。2.让链接提取器提取所有的新闻详情页的链接。3.从每个页码对应的页面源码中解析出每一个电影详情页URL。将链接器提取器提取到的链接进行指定规则（callback）的解析操作。1.使用链接提取器提取所用页码链接。4.对详情页url发起请求，然后解析出电影名称和简介。根据指定规则，进行指定提取(allow="正则")，提取指定链接。是spider的子类，用于全站数据爬取（就是爬所有页码。概念：检测网站数据更新情况，只会爬取网站更新的数据。
复制链接

扫一扫

专栏目录