python网络爬虫（第十二章：自动爬取网页的爬虫CrawlSpider)）-CSDN博客

本文链接：https://blog.csdn.net/qq_38633279/article/details/119764696

1.CrawlSpider

CrawlSpider：类，是Spider的一个子类
全站数据的爬取方式：

1.基于Spider：手动请求
2.基于CrawlSpider

案例1：爬取小程序社区信息

步骤1. scrapy startproject shequPro
步骤2. sc shequPro
步骤3.scrapy genspider -t crawl tengxun www.xxx.com

步骤4. spider.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tengxunPro.items import TengxunproItem
from tengxunPro.items import Tengxunproitem
class TengxuSpider(CrawlSpider):
    name = 'tengxu'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']   #https://careers.tencent.com/search.html?index=1

    #链接提取器：根据指定规则（allow="正则"）进行指定链接的提取
    link = LinkExtractor(allow=r'page=\d+')

    #获取详细页面提取器  https://www.wxapp-union.com/article-7141-1.html  https://www.wxapp-union.com/article-7137-1.html
    detail_link = LinkExtractor(allow=r'.+article-.+\.html')

    rules = (
        #规则解析器：将链接提取器提取到的链接进行指定规则（callback）的解析操作
        Rule(link, callback='parse_item', follow=False),
        #follow= True：可以将链接提取器继续作用到链接提取器提取到的链接所对应的页面中

        Rule(detail_link, callback='parse_detail'),
    )

    #数据解析---解析的是深度解析URL
    def parse_item(self, response):
        item = TengxunproItem()
        div_list = response.xpath('//*[@id="itemContainer"]/div')
        for div in div_list:
            detail_url = div.xpath('./a/@href').extract_first()
            title = div.xpath('./a/img/@alt').extract_first()
            item['detail_url'] = detail_url
            item['title'] = title

            # print('title:',title)
            # print('detail_url',detail_url)
            yield item


    #详情页数据解析
    def parse_detail(self, response):
        # print(response)
        item = Tengxunproitem()
        detail_title = response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[1]/h1/text()').extract_first()
        contents = response.xpath('//*[@id="article_content"]/div//text()').extract()

        item['detail_title'] = detail_title
        item['contents'] = contents
        yield item

步骤5.items.py

import scrapy


class TengxunproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    detail_url = scrapy.Field()

class Tengxunproitem(scrapy.Item):
    # define the fields for your item here like:
    detail_title = scrapy.Field()
    contents = scrapy.Field()

步骤6.pipelines.py

class TengxunproPipeline:
    def process_item(self, item, spider):
        #如何判断item的类型
        if item.__class__.__name__ == 'Tengxunproitem':
            print(item['detail_title'],item['contents'])
        else:
            print(item['title'],item['detail_url'])
        return item

备注：1.CrawlScrapy与Scrapy区别

CrawlScrapy首先是
-scrapy genspider -t crawl 名称 www.xxx.com
-link = LinkExtractor(allow=r’正则表达式’)
-Rule(link, callback=‘parse_item’, follow=False) #这里的callback就是我们需要解析的parse_item()方法
-item的请求参数的传递，也是在items.py中单独写出来，和parse_item()方法对应起来
-pipelines.py 数据存储也是要处理的