crawlspider学习

最新推荐文章于 2023-06-28 10:42:36 发布

阿旺不会飞

最新推荐文章于 2023-06-28 10:42:36 发布

阅读量222

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_43604442/article/details/104017361

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

CrawlSpider

crawlspider的使用
案例

crawlspider的使用

1.scrapy startproject 爬虫名
2.cd 项目目录下
3.scrapy genspider -t crawl 爬虫名 allow_domain
4.指定start_url，对应的响应会经过rules提取url
5.完善rules,添加Rule

注意点：
LinkExtractors：Link Extractors 的目的是提取链接，调用的是extract_links(),其提供了过滤器(filter),
以便于提取包括符合正则表达式的链接。 过滤器通过以下构造函数的参数配置:

allow (a regular expression (or list of)) – 必须要匹配这个正则表达式(或正则表达式列表)的URL才会被
提取｡如果没有给出(或为空), 它会匹配所有的链接｡


Rules:在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了
相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response
作为其第一个参数。 注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来
实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。
follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，
follow 默认设置为True ，否则默认为False。

url不完整，crawlspider会自动帮我们补充完整
parse函数不能定义，它有特殊含义，框架内部定义
callback :连接提取器提取出来的url地址对应的响应交给他处理
follow：连接提取器提取出来的url地址对应的响应是否继续呗rules来过滤

案例

scrapy scrapy startproject circ创建爬虫项目
cd circ进入项目目录
scrapy genspider -t crawl cf circ.gov.cn 生成爬虫

使用ide工具进入cf.py文件进行编写

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from circ.items import CircItem
import re
class CfSpider(CrawlSpider):
    name = 'cf'
    allowed_domains = ['circ.gov.cn']
    start_urls = ['http://circ.gov.cn/web/site0/tab5240/module14430/page1.htm']

    rules = (
        Rule(LinkExtractor(allow=r'/web/site0/tab5240/info\d+\.htm'), callback='parse_item'),
        Rule(LinkExtractor(allow=r'/web/site0/tab5240/module14430/page\d\.htm'),follow=True)
    )

    def parse_item(self, response):

        item = CircItem()
        item["title"] = re.findall("<!--TitleStart-->(.*?)<!--TitleEnd-->",response.body.decode())[0]
        item["publish_date"] = re.findall("发布时间：(20\d{2}-\d{2}-\d{2})",response.body.decode())[0]
        yield item

打开settings文件，将以下配置注释打开
在这里插入图片描述
在pipelines.py文件中进行读写操作

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import json
class CircPipeline(object):
    def process_item(self, item, spider):
       with open("data.json","a",encoding="utf-8") as f:
           json.dump(dict(item),f,ensure_ascii=False,indent=2)
       return item

items.py内容如下

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CircItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    publish_date = scrapy.Field()

进入命令行模式
输入scrapy crawl cf 执行爬虫程序
结果如下，抓取到数据
![ ](https://img-blog.csdnimg.cn/20200117122505912.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzYwNDQ0Mg==,size_16,color_FFFFFF,t_70)

阿旺不会飞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
crawlspider学习

CrawlSpidercrawlspider的使用案例crawlspider的使用1.scrapy startproject 爬虫名2.cd 项目目录下3.scrapy genspider -t crawl 爬虫名 allow_domain4.指定start_url，对应的响应会经过rules提取url5.完善rules,添加Rule注意点：LinkExtractors：Lin...
复制链接

扫一扫