scrapy使用crawl模板创建spider使用链接提取器

Python And Go

于 2023-05-23 15:36:50 发布

阅读量177

点赞数

文章标签： scrapy

本文链接：https://blog.csdn.net/weixin_43474835/article/details/130828334

版权

本文介绍了如何使用Scrapy的CrawlSpider模板创建爬虫，强调了Rule和LinkExtractor在设置爬取规则中的作用。通过示例代码展示了如何提取详情页和分页URL，并控制follow属性以决定是否继续爬取链接。在parse_item方法中进行数据处理。

摘要由CSDN通过智能技术生成

Scrapy 使用crawl模板创建spider的注意：

我们默认创建scrapy的时候使用的是template是basic模板

scrapy genspider   ~~spider名称~~  ~~allowed_domains~~

当使用crawl模板创建

scrapy genspider -t crawl  ~~spider名称~~  ~~allowed_domains~~

spider 内容如下:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CheSpider(CrawlSpider):
    name = "che"
    allowed_domains = ["xxxx.com"]
    start_urls = ["https://xxxx.com"]

    rules = (
        Rule(lLinkExtractor(allow=r''), callback="parse_item", follow=False) )

    def parse_item(self, resp, **kwargs):
    	 item{}
     	return item

以下是修改后的内容：

   class CheSpider(CrawlSpider):
    name = "che"
    allowed_domains = ["che168.com"]
    start_urls = ["https://www.che168.com/chengdu/8_10/a1_3ms8dgscncgpi1ltocspexx0/"]
    # 提取详情页面的url
    lk1 = LinkExtractor(restrict_xpaths=('//ul[@class="viewlist_ul"]/li/a'))
    # 提取分页的url
    lk2 = LinkExtractor(restrict_xpaths=('//div[@id="listpagination"]/a'))
    # rules 规则
    rules = (
        # parse_item的作用：当整个相应回来之后，经过链接提取器的提取拿到url，会自动发送请求。
        # 这个详情页面的url响应回来之后，去执行callback
        # follow 表示链接提取器提取的链接，发送请求回来之后，是否需要执行当前所有规则
        # follow 控制当前被提取到的链接是否把规则重新走一遍
        Rule(lk1, callback="parse_item", follow=False),  # 详情页的逻辑
        Rule(lk2, follow=True)  # 分页的逻辑， 这里不需要callback的
    )

    def parse_item(self, resp, **kwargs):
        print(resp.url)