scrapy框架中crawlspider的使用

最新推荐文章于 2024-07-08 11:34:45 发布

水痕01

最新推荐文章于 2024-07-08 11:34:45 发布

阅读量1.8k

点赞数 1

分类专栏：爬虫文章标签： scrapy python

本文链接：https://blog.csdn.net/kuangshp128/article/details/80304982

版权

爬虫专栏收录该内容

13 篇文章 1 订阅

订阅专栏

一、初识`crawlspider`

1、创建项目
```
scrapy startproject 项目名称
```
2、查看爬虫模板
```
scrapy genspider -l
```

3、创建crawl模板

scrapy genspider -t crawl 爬虫名称 地址

4、自动生成模板如下

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class WeisuenSpider(CrawlSpider):
    name = 'weisuen'
    allowed_domains = ['sohu.com']
    start_urls = ['http://sohu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

二、关于参数的介绍

1、crawl爬虫是继承了CrawlSpider不是默认模板中继承的scrapy

2、新增了一个规则

# 表示我们想提取链接中有`.shtml`字符串的链接
rules = (
        Rule(LinkExtractor(allow='.shtml'), callback='parse_item', follow=True),
    )

3、关于LinkExtractor参数的介绍

参数名	参数含义
allow	提取符合对应正则表达式的链接
deny	不提取符合对应正则表达式的链接
restrict_xpaths	使用xpath表达式与allow共用作用提取出同时符合对应xpath表达式和对应正则表达式的链接
allow_domains	允许提取的域名，比如我只想提取某个域名下的链接时候会使用
deny_domains	禁止提取的域名,比如我需要限制一定不提取某个域名下的链接时会使用

4、举例使用

# 表示抓取网页上以`shtml`结尾的url地址
rules = (
        Rule(LinkExtractor(allow='.*?/n.*?shtml', allow_domains=('sohu.com',)), callback='parse_item', follow=True),
    )