CrawlSpider的使用方法

最新推荐文章于 2023-05-25 15:11:29 发布

BRUIN.

最新推荐文章于 2023-05-25 15:11:29 发布

阅读量594

点赞数

分类专栏： Python爬虫文章标签： python 正则表达式

本文链接：https://blog.csdn.net/i_i___lo_ve___ya/article/details/105187509

版权

Python爬虫专栏收录该内容

38 篇文章 2 订阅

订阅专栏

首先使用cmd创建项目，创建完成之后，修改start_urls之后就可以开始写rules

cd    至目标文件夹下
scrapy startproject  project_name    创建crawlspider项目
cd    project_name     至项目文件夹中
scrapy   genspider  -t   crawlspider    spider_name   allowed_domains    创建爬虫文件

1.LinkExtracts
可以提取想要的url，然后发送请求。这些⼯作都可以交给LinkExtractors，会在所有爬的⻚⾯中找到满⾜规则的url，实现⾃动的爬取

主要参数：
 allow：需要提取的url，使用正则表达式提取
deny：不需要的url。所有满⾜这个正则表达式的url都不会被提取。
allow_domains：  只有在这个⾥⾯指定的域名的url才会被提 取。 
deny_domains：  所有在这个⾥⾯指定的域名的url都不会被 提取。 
restrict_xpaths：   使用xpath提取url。和allow共同过滤链接。

2.Rule

LinkExtracts：    用于定义爬取url的规则
callback：    满⾜这个规则的url，应该要执⾏哪个回调函数。因为 CrawlSpider使⽤了parse作为回调函数，因此不要覆盖parse作为回调函 数⾃⼰的回调函数。 
follow：    response中的url是否需要继续提取
process_links：从linkextractor中获取到链接后会传递给这个函数，⽤来过滤不需要爬取的链接。

爬取阳光政务平台数据代码如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class Yg2Spider(CrawlSpider):
    name = 'yg2'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type']

    rules = (
        Rule(LinkExtractor(allow=r'wz.sun0769.com/index.php/question/report?page=\d+'), follow=True),

        Rule(LinkExtractor(allow=r'wz.sun0769.com/html/question/202003/\d+.shtml'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = MyspiderItem()
        item['name'] = response.xpath('//div[@class="wzy3_2"]/span/text()').get()

        item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        item['name'] = response.xpath('//div[@id="name"]').get()
        item['description'] = response.xpath('//div[@id="description"]').get()
        return item