Scrapy-通用爬虫

最新推荐文章于 2021-08-31 17:30:00 发布

wwxxee

最新推荐文章于 2021-08-31 17:30:00 发布

阅读量253

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_42633229/article/details/103770333

版权

爬虫专栏收录该内容

8 篇文章 1 订阅

订阅专栏

#1.CrawlSpider

CrawlSpider是Scrapy提供的一个通用Spider。在Spider里，我们可以指定一些爬取规则来实现页面的提取，这些爬取规则有一个专门的数据结构Rule表示。Rule里面包含提取和跟进页面的配置，Spider会根据Rule来确定当前页面中的哪些链接需要继续爬取，哪些页面的爬取规则结果用哪个方法解析。
CrawlSpider继承自Spider类。它有一个非常重要的属性和方法。
rules ：爬取规则，是包含一个或多个Rule对象的列表，每个Rule对爬取网站的动作都做了定义，CrawlSpider会读取rules里面的每一个Rule并进行解析
parse_start_url()：是一个可以重写的方法，当strat_urls 里对应的Request得到Response的时候，该方法会被调用，它会分析Response并必须返回Item对象或者Request对象。

Rule的定义和参数如下：

class scrapy.contrib.spider.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

参数解释：

link_extractor: 是Link Extractor 对象。通过它可以知道从爬取的网页里提取哪些链接。提取出来的链接会自动生成Request。
LxmlLinkExtractor(
allow=(), deny=(), allow_domains=(),dent_domains=(), restrict_xpaths=(),restrict_css())
– 主要参数解释：
– allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
– deny：与allow 相反。
– allow_domains：会被提取的链接的domains，符合要求的域名的链接才会被跟进-生成新的Request。
– deny_domains：与allow_domains相反。
– restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。定义了从当前页面中Xpath匹配的区域提取链接。
**callback:**回调函数。每次从link_extractor中获取到链接时，该函数会被调用，返回一个包含Item或者Request对象的列表。（注意：避免使用parse()作为回调函数，由于CrawlSpiser使用parse()方法来实现其逻辑，如果parse()方法被覆盖，将运行失败）
**cb_kwargs:**字典，包含传递给回调函数的参数
**follow:**Trur or False . 表示是否跟进提取到的链接。如果callback为None 默认为True
**process_links：**指定出来函数，提取到链接时被调用，只要用于过滤链接
**process_request：**该Rule提取到Request时被调用，对Request进行处理。该函数必须返回Request或者None。
#2.Item Loader
Rule定义了页面的爬取逻辑，但是没有对Item的提取方式做规则定义，这就需要借助Item Loader来实现。

class scrapy.loader.ItemLoader([item, selector, response,] **kwargs)

参数解释：
item: 它是一个Item对象，可以调用add_xpath(),add_css(),add_value() 等方法来填充Item。
selector: Selector 对象，用来提取填充数据的选择器
response: Response对象，用于使用构造选择器的Response
下面是常规提取方式与Item Loader 配置化提取方式的对比：
常规提取方式：

# 常规处理手段
        item = ChinanewsItem()
        item['title'] = response.xpath("//h1[@id='chan_newsTitle']/text()").extract_first()
        item['url'] = response.url
        item['text'] = "".join(response.xpath("//div[@id='chan_newsDetail']//text()").extract()).strip()
        item['datetime'] = response.xpath("//div[@id='chan_newsInfo']/text()").re_first('\d+-\d+-\d+\s\d+:\d+:\d+')
        item['source'] = response.xpath("//div[@id='chan_newsInfo']/text()").re_first('来源：(.*)')
        item['website'] = '中华网'
        yield item

配置化提取方式：

# 配置化提取方法
        loader = ChinaLoader(item=ChinanewsItem(),response=response)
        loader.add_xpath('title',"//h1[@id='chan_newsTitle']/text()")
        loader.add_value('url',response.url)
        loader.add_xpath('text',"//div[@id='chan_newsDetail']//text()")
        loader.add_xpath('datetime',"//div[@id='chan_newsInfo']/text()",re='\d+-\d+-\d+\s\d+:\d+:\d+')
        loader.add_xpath('source',"//div[@id='chan_newsInfo']/text()",re='来源：(.*)')
        loader.add_value('website','中华网')
        yield loader.load_item()

首先声明一个ItemLoader对象，用Item对象和Response对象实例化这个ItemLoader，ChinanewsItem()对象为items.py 里的Item对象，这里定义了Item字段。
item.py

ChinaLoder()对象为loader.py 里定义的类对象。ChinaLoader继承了NewsLoader，NewsLoader继承了ItemLoader，所以，这个类为ItemLoader的子类。其实现如下图所示。

然后把根据xpath匹配出来的数据分配给title,url,text,datetime,source,website属性，即用不同方法给属性赋值。最后调用loader_item()方法实现Item解析。这个方式比较规则化，我们可以把一些参数和规则单独提取出来做成配置文件，即可实现可配置化。
另外，ItemLoader每个字段都包含一个Input Processor 和 Output Processor。Input Processor收到数据时立刻提取数据，Input Processor的结果被收集起来并保存在ItemLoader内，但是不分配给Item。收集到所有数据之后，Output Processor被调用来处理收集到的数据，然后load_item()方法再被调用来填充生成Item对象，存入Item中。这样就生成了Item。

内置的Processor：

Identity
不进行任何处理，直接返回原来的数据
TakeFirst
返回列表里的第一个非空值。与extract_first()类似。常用作 Output Processor
Join
与字符串的join(）方法类似，把列表拼和成字符串
Compose
是用给定的多个函数组合而成的Processor，每个输入值被传递到第一个函数，其输出值被传递到第二个函数，以此类推，直到最后一个函数返回整个处理器的输出。
MapCompose
迭代处理一个列表输入值，被处理的是一个可迭代对象，MapCompose会将该对象遍历然后依次处理。
SelectJmes
需要先安装 jmespath库才可以：
pip3 install jmespath
作用是查询Json，传入Key，输出Value

#3. 例子：
#####scrapy startproject chinaNews
#####scrapy genspider -t crawl china tech.china.com
######初始spider/china.py:


# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['http://tech.china.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

Spider内容多了rules的定义，解析方法的名称为parse_item()

###3.1 定义Rule
修改start_urls为起始链接。
之后，Spider爬取start_urls里的每一个链接。得到Response之后，Spider就会根据每一个Rule来提取这个页面的超链接，去生成Request。

Rule首先定义一个新闻详情页的提取规则，使用restrct_xpaths()限制提取的域，并且指定回调函数为parse_item(),不跟进处理。
接着定义翻页的提取规则，不同页的链接只有index后面的数字不一样，所以使用响应的正则表达式匹配出页面链接，没有回调函数并且继续跟进。

###3.2 定义Item（图items.py）
###3.3 定义ItemLoader（图loader.py）
###3.4 定义解析方法（配置化提取）
###3.5 定义pipelines
###3.6 定义middlewares
###3.7 定义settings

######china.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
# 导入Item
from chinaNews.items import ChinanewsItem
# 导入ItemLoader
from chinaNews.loaders import ChinaLoader

class ChinaSpider(CrawlSpider):
    name = 'china'
    allowed_domains = ['tech.china.com']
    start_urls = ['https://tech.china.com/articles/index.html']

    rules = (
        Rule(LinkExtractor(allow=(r'article/.*.html'),restrict_xpaths='//div[@class="item_con"]'), callback='parse_item', follow=False),
        Rule(LinkExtractor(allow=(r"articles/index_\d+.html"),restrict_xpaths='//div[@class="pages"]'),follow=True)
    )

    def parse_item(self, response):
        """
        # 常规处理手段
        item = ChinanewsItem()
        item['title'] = response.xpath("//h1[@id='chan_newsTitle']/text()").extract_first()
        item['url'] = response.url
        item['text'] = "".join(response.xpath("//div[@id='chan_newsDetail']//text()").extract()).strip()
        item['datetime'] = response.xpath("//div[@id='chan_newsInfo']/text()").re_first('\d+-\d+-\d+\s\d+:\d+:\d+')
        item['source'] = response.xpath("//div[@id='chan_newsInfo']/text()").re_first('来源：(.*)')
        item['website'] = '中华网'
        yield item
        """
        # 配置化提取方法
        loader = ChinaLoader(item=ChinanewsItem(),response=response)
        loader.add_xpath('title',"//h1[@id='chan_newsTitle']/text()")
        loader.add_value('url',response.url)
        loader.add_xpath('text',"//div[@id='chan_newsDetail']//text()")
        loader.add_xpath('datetime',"//div[@id='chan_newsInfo']/text()",re='\d+-\d+-\d+\s\d+:\d+:\d+')
        loader.add_xpath('source',"//div[@id='chan_newsInfo']/text()",re='来源：(.*)')
        loader.add_value('website','中华网')

        yield loader.load_item()

结构.png

以上实现爬虫的半通用化配置。
全通用化配置还需要实现通用配置抽取（变量、属性抽取）

wwxxee

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy-通用爬虫

#1.CrawlSpiderCrawlSpider是Scrapy提供的一个通用Spider。在Spider里，我们可以指定一些爬取规则来实现页面的提取，这些爬取规则有一个专门的数据结构Rule表示。Rule里面包含提取和跟进页面的配置，Spider会根据Rule来确定当前页面中的哪些链接需要继续爬取，哪些页面的爬取规则结果用哪个方法解析。CrawlSpider继承自Spider类。它有一个非...
复制链接

扫一扫