crawlspider structure

最新推荐文章于 2020-10-10 15:32:28 发布

wtftx

最新推荐文章于 2020-10-10 15:32:28 发布

阅读量173

点赞数

分类专栏： scrapy 框架文章标签： scrapy crawlspider crawl

本文链接：https://blog.csdn.net/wtftx/article/details/89841227

版权

scrapy 框架专栏收录该内容

10 篇文章 1 订阅

订阅专栏

记录一下相关参数方便查用

可以继承四种类来建立scrapy爬虫：Spider类，CrawlSpider类， CSVFeedSpider类和XMLFeedSpider类。

scrapy genspider -t modulewewant filename domain.com

对于一些比较规则的网站用Spider类去进行简单自动化爬取，但是对于一些较为复杂或者说链接的存放不规则的网站可以使用 crawlspider 类，而且还可以更加自动化爬取链接和链接内容。

新生成的spider有一个rules属性，还有这个爬虫继承的类是CrawlSpider，其中rules属性使这个爬虫的核心.
rules属性由几个Rule对象构成，Rule对象定义了提取链接等操作的规则

Rule对象有六个属性，他们分别是：

LinkExtractor(…)，用于提取response中的链接
callback=‘str’，回调函数，对提取的链接使用，用于提取数据填充item
cb_kwargs，传递给回调函数的参数字典
follow=True/False，对提取的链接是否需要跟进
process_links，一个过滤链接的函数
process_request，一个过滤链接Request的函数

上面的参数除了LinkExtractor外其它都是可选的，且当callback参数为None时，称这个rule为一个‘跳板’，也就是只下载页面，并不进行任何行为，通常作翻页功能。

LinkExtractor参数有十个参数，用来定义提取链接的规则，分别是：

allow=‘re_str’:正则表达式字符串，提取response中符合re表达式的链接。
deny=‘re_str’：排除正则表达式匹配的链接
restrict_xpaths=‘xpath_str’：提取满足xpath表达式的链接
restrict_css=‘css_str’:提取满足css表达式的链接
allow_domains=‘domain_str’:允许的域名
deny_domains=‘domain_str’：排除的域名
tags=‘tag’/[‘tag1’,’tag2’,…]：提取指定标签下的链接，默认会从a和area标签下提取链接
attrs=[‘href’,’src’,…]：提取满足属性的链接
unique=True/False：链接是否去重
10.process_value：值处理函数，优先级要大于allow
以上的参数可以一起使用，以提取同时满足条件的链接
canonicalize: True/False, 默认为false，重复连接检查，最好为默认值。
deny_extensions: 提取链接忽略的对象
follow参数：
为Boolean值，用于是否跟进链接的处理，在callback为None时，默认是跟进链接的，值为True；当callback不为空时，默认是False的，不跟进链接。可以根据需要赋值。

Rules工作原理
对于Rule提取的链接会自动调用parse函数，并返回该链接的response，然后将这个response给callback回调函数，通过回调函数的解析对item进行填充。

CrawlSpider爬虫还有一个parse_start_url() 方法，用于解析start_urls中的链接页面，这个方法一般用于有跳板的爬虫中，用于对首页的解析。

小栗子实现对 scrapinghub博客的爬取

创建工程

scrapy startproject shubpro
cd shubpro
scrapy genspider -t crawl shub "https://blog.scrapinghub.com/page/1"
# 链接只能使用 "" 或者无wrapping 不能使用''，含参数必须使用""

定义items

import scrapy
class ShubproItem(scrapy.Item):
    post_name = scrapy.Field()
# 这里仅做测试，根据需要定义filed类

定义pipelines (可选项)

定义 pipelines.py

import json
import codecs

class ShubproPipeline(object):
    def __init__(self):
        self.file = codecs.open('shub_postname.json', 'wb', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        # print line
        self.file.write(line.encode('utf-8').decode('unicode_escape'))
        # 对中文输出的处理，否则直接写入 line
        return
    def close_spider(self, spider):
        self.file.close()

编写爬虫文件

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from shubpro.items import ShubproItem

class ShubSpider(CrawlSpider):
    name = 'shub'
    allowed_domains = ['scrapinghub.com']
    start_urls = ['https://blog.scrapinghub.com/page/1']

    rules = (
        Rule(LinkExtractor(allow=r'https://blog.scrapinghub.com/page/\d+',
                           restrict_xpaths='//div[@class="blog-pagination"]/a[@class="next-posts-link"]'), follow=True),
        Rule(LinkExtractor(allow=r'https://blog.scrapinghub.com/\w+',
                           restrict_xpaths='//div[@class="post-header"]/h2/a'), callback='parse_item', follow=False),
    )
# 第一个Rule实现翻页功能，第二个解析，注意 restrict_xpaths 定义到 a节点即可，不要具体到@href属性
    def parse_item(self, response):
        item = ShubproItem()
        item['post_name'] = response.xpath('//h1/span/text()').get(default='No result')
        return item

修改 settings.py 文件


BOT_NAME = 'shubpro'

SPIDER_MODULES = ['shubpro.spiders']
NEWSPIDER_MODULE = 'shubpro.spiders'

#USER_AGENT = 'shubpro (+http://www.yourdomain.com)'
# 防止被反爬发现可以进行修改，见前文

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 不遵守robotstxt 协议
#DOWNLOAD_DELAY = 1
# 设置爬取间隔，单位秒（s），该网站不设限所以没有设置

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# 是否使用cookies 防止反扒可以设置为false

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'shubpro.pipelines.ShubproPipeline': 300,
#}
# 管道文件 pipelines.py 是否生效, 如果步骤三设置则需要打开

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'shubpro.middlewares.ShubproDownloaderMiddleware': 543,
#}
# 如果通过downloadermiddleware 设置user-agent 则需要打开

输出结果
不定义pipeline 输出文件需要：

scrapy crawl shub -o shub.json [-t json]
# 输出包含中文的话，需要增加  -s FEED_EXPORT_ENCODING='utf-8'

output
output2
如图一共爬取132 items.

wtftx

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录