(30)爬虫--CrawlSpider自动获取爬取链接

最新推荐文章于 2023-03-09 14:33:01 发布

小蜗笔记

最新推荐文章于 2023-03-09 14:33:01 发布

阅读量399

点赞数

分类专栏：爬虫实战模块

本文链接：https://blog.csdn.net/qq_42830971/article/details/107931496

版权

爬虫实战模块专栏收录该内容

51 篇文章 11 订阅

订阅专栏

本文介绍了一个使用Scrapy框架的爬虫示例，该爬虫用于抓取zwdu.com网站的小说章节标题和内容。通过定义CrawlSpider类，设置起始URL和允许的域名，利用LinkExtractor和Rule解析网页链接并提取所需数据。

摘要由CSDN通过智能技术生成

scrapy genspider -t crawl zwr zedu.com

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ZwrSpider(CrawlSpider):
    name = 'zwr'
    allowed_domains = ['zwdu.com']
    start_urls = ['https://www.zwdu.com/book/10304/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths=r'''//dd/a'''), callback='parse_item', follow=True),
        Rule(LinkExtractor(restrict_xpaths = r'''//div[@class='bottem1']/a[3]'''), callback = 'parse_item', follow = True),
    )

    def parse_item(self, response):
        title = response.xpath('//h1/text()').extract_first()
        content = ''.join(response.xpath('''//div[@id='content']/text()''').extract()).replace('    ', '\n    ')
        yield {'title': title,
               'content': content}