crawlspider解析，并使用crawlspider爬取八一中文网小说

最新推荐文章于 2022-12-09 14:55:41 发布

weixin_46837101

最新推荐文章于 2022-12-09 14:55:41 发布

阅读量421

点赞数

分类专栏：爬虫系列文章标签： python

本文链接：https://blog.csdn.net/weixin_46837101/article/details/106961122

版权

爬虫系列专栏收录该内容

24 篇文章 0 订阅

订阅专栏

crawlspider类的写法

1. CrawlSpiders

原理图

```

sequenceDiagram

start_urls ->>调度器:

初始化url 调度器->>下载器: request

下载器->>rules: response

rules->>数据提取: response

rules->>调度器: 新的url

```

通过下面的命令可以快速创建 CrawlSpider模板的代码

scrapy genspider -t crawl 文件名 (allowed_url)

首先在说下Spider，它是所有爬虫的基类，而CrawSpiders就是Spider的派生类。对于设计原则是只爬取start_url列表中的网页，而从爬取的网页中获取link并继续爬取的工作CrawlSpider类更适合

2. Rule对象

Rule类与CrawlSpider类都位于scrapy.contrib.spiders模块中

class scrapy.contrib.spiders.Rule ( link_extractor, 
    callback=None,cb_kwargs=None,
    follow=None,process_links=None,
    process_request=None )

参数含义：

link_extractor为LinkExtractor，用于定义需要提取的链接
callback参数：当link_extractor获取到链接时参数所指定的值作为回调函数

callback参数使用注意：当编写爬虫规则时，请避免使用parse作为回调函数。于CrawlSpider使用parse方法来实现其逻辑，如果您覆盖了parse方法，crawlspider将会运行失败
follow：指定了根据该规则从response提取的链接是否需要跟进。当callback为None,默认值为True
process_links：主要用来过滤由link_extractor获取到的链接
process_request：主要用来过滤在rule中提取到的request

3.LinkExtractors

3.1 概念

顾名思义，链接提取器

3.2 作用

response对象中获取链接，并且该链接会被接下来爬取每个LinkExtractor有唯一的公共方法是 extract_links()，它接收一个 Response 对象，并返回一个 scrapy.link.Link 对象

3.3 使用

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

主要参数：

allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
allow_domains：会被提取的链接的domains。
deny_domains：一定不会被提取链接的domains。
restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接(只选到节点，不选到属性)

3.3.1 查看效果（shell中验证)

首先运行

scrapy shell http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml

继续import相关模块：

from scrapy.linkextractors import LinkExtractor

提取当前网页中获得的链接

link = LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')

调用LinkExtractor实例的extract_links()方法查询匹配结果

link.extract_links(response)

3.3.2 查看效果 CrawlSpider版本

callback后面函数名用引号引起
函数名不能是parse
格式问题

使用crawlspider类爬取八一中文网小说

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

#CrawlSpider是继承spider的类，添加了一项rules规则，其他和spider一样
class Zzw2Spider(CrawlSpider):
    name = 'zzw2'
    allowed_domains = ['81zw.com']
    start_urls = ['https://www.81zw.com/book/13205/']

    rules = (
        #xpath自动解析a标签后面的参数内容，如果添加会显示str没有iter属性
        Rule(LinkExtractor(restrict_xpaths='//div[@id="list"]/dl/dd[2]/a'), callback='parse_item', follow=True),
        Rule(LinkExtractor(restrict_xpaths='//div[@class="bottem1"]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        title = response.xpath('//h1/text()').extract_first()
        content = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('    ', '\n')

        yield {
            'title': title,
            'content': content
        }

pipelines提取保存数据

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class XiaoshuoPipeline:
    def open_spider(self,spider):
        self.filename=open('lwcs.txt','w',encoding='utf-8')

    def process_item(self, item, spider):

        title=item['title']
        content=item['content']
        #TypeError: can only concatenate list (not "str") to list错误原因是xpath解析的格式不统一
        #info = title + '\n' + content + '\n'
        info = title + '\n'
        self.filename.write(info)

        #如果文件里没有内容，是因为字节流的原因，每次填充几十条不会显示，刷新一下
        self.filename.flush()
        return item

    def close_spider(self,spider):
        self.filename.close()

weixin_46837101

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
crawlspider解析，并使用crawlspider爬取八一中文网小说

crawlspider类的写法1. CrawlSpiders原理图```sequenceDiagramstart_urls ->>调度器:初始化url 调度器->>下载器: request下载器->>rules: responserules->>数据提取: responserules->>调度器: 新的url```通过下面的命令可以快速创建 CrawlSpider模板的代码scrapy ge.
复制链接

扫一扫