Scrapy--3 Scrapy CrawlSpider说明

最新推荐文章于 2024-04-28 05:07:32 发布

无痕的雨

最新推荐文章于 2024-04-28 05:07:32 发布

阅读量241

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_45451647/article/details/113875857

版权

爬虫专栏收录该内容

24 篇文章 0 订阅

订阅专栏

Scrapy CrawlSpider说明

------------------------------------问题----------------------------------------------

之前的代码中，我们有很⼤⼀部分时间在寻找下⼀⻚的URL地址或者内容的 URL地址上⾯，这个过程能更简单⼀些吗？
思路：

1.从response中提取所有的标签对应的URL地址
2.⾃动的构造⾃⼰resquests请求，发送给引擎

⽬标：

通过爬⾍了解crawlspider的使⽤

⽣成crawlspider的命令：

scrapy genspider -t crawl 爬⾍名字域名
在终端内生成Scrapy CrawlSpider项目

---------------------Scrapy CrawlSpider项目中函数说明------------------

LinkExtractors链接提取器

使⽤LinkExtractors可以不⽤程序员⾃⼰提取想要的url，然后发送请求。
这些⼯作都可以交给LinkExtractors，他会在所有爬的⻚⾯中找到满⾜规则的url，实现⾃动的爬取。

参数讲解：

allow：允许的url。所有满⾜这个正则表达式的url都会被提取。

deny：禁⽌的url。所有满⾜这个正则表达式的url都不会被提取。

allow_domains：允许的域名。只有在这个⾥⾯指定的域名的url才会被提取。

deny_domains：禁⽌的域名。所有在这个⾥⾯指定的域名的url都不会被提取。

restrict_xpaths：严格的xpath。和allow共同过滤链接。

Rule规则类
定义爬虫规则类

主要参数讲解：

link_extractor：⼀个LinkExtractor对象，⽤于定义爬取规则。

callback：满⾜这个规则的url，应该要执⾏哪个回调函数。因为 CrawlSpider使⽤了parse作为回调函数，因此不要覆盖parse作为回调函数⾃⼰的回调函数。

follow：指定根据该规则从response中提取的链接是否需要跟进。

process_links：从link_extractor中获取到链接后会传递给这个函数，⽤来过滤不需要爬取的链接。

实例

——————————案例——代码————————————

需求：

1.爬取古诗文网站
2.可以进行翻页
3.提取详情页中的作者，朝代，古诗文译文等等。

第一步创建Scrapy项目

scrapy startproject gs

第二步进入创建好的项目文件夹

cd gs
进入创建好的文件夹

第三步进行爬虫项目的创建

scrapy genspider -t crawl cgs gushiwen.org
创建爬虫项目

第四步进行爬虫代码编写以及scrapy中的设置

---------爬虫项目代码-----

LinkExtractor from scrapy.spiders import CrawlSpider, Rule


class CgsSpider(CrawlSpider):
    name = 'cgs'
    allowed_domains = ['gushiwen.org','gushiwen.cn']
    start_urls = ['http://gushiwen.org/default_1.aspx']
    #Rule是一个类，定义提取url的规则
    #LinkExtractor 链接提取器
    # allow=r'Items/'存放url(正则 重点) callback 回调函数 follow=True 继续跟进(跟进下一页)
    rules = (
        #列表页1---> #列表页改为cn数据就会源源不断
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True),
        #详情页
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'),
callback='parse_item')
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        content = response.xpath('//div[@class="contyishang"]/p/text()').extract()
        detail = ''.join(content).strip()
        item['detail_content'] = detail
        print(item)
        return item

---------Scrapy CrawlSpider设置代码------

# Scrapy settings for gs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'gs'

SPIDER_MODULES = ['gs.spiders'] NEWSPIDER_MODULE = 'gs.spiders'

LOG_LEVEL='WARNING'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'gs (+http://www.yourdomain.com)'

# Obey robots.txt rules ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers: DEFAULT_REQUEST_HEADERS = {   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko)      Chrome/86.0.4240.111
Safari/537.36",   'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  
'Accept-Language': 'en', }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'gs.middlewares.GsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'gs.middlewares.GsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'gs.pipelines.GsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

---------Scrapy CrawlSpider项目启动代码------

cmdline.execute(['scrapy','crawl','cgs'])

总结：

        #列表页1---> #列表页改为cn数据就会源源不断
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True),
        #详情页
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'),
callback='parse_item')
    ) ```