Scrapy CrawlSpider说明
------------------------------------问题----------------------------------------------
-
之前的代码中,我们有很⼤⼀部分时间在寻找下⼀⻚的URL地址或者内容的 URL地址上⾯,这个过程能更简单⼀些吗?
-
思路:
- 1.从response中提取所有的标签对应的URL地址
- 2.⾃动的构造⾃⼰resquests请求,发送给引擎
- ⽬标:
- 通过爬⾍了解crawlspider的使⽤
- ⽣成crawlspider的命令:
- scrapy genspider -t crawl 爬⾍名字 域名
- 在终端内生成Scrapy CrawlSpider项目
---------------------Scrapy CrawlSpider项目中函数说明------------------
- LinkExtractors链接提取器
- 使⽤LinkExtractors可以不⽤程序员⾃⼰提取想要的url,然后发送请求。
- 这些 ⼯作都可以交给LinkExtractors,他会在所有爬的⻚⾯中找到满⾜规则的url, 实现⾃动的爬取。
参数讲解:
- allow:允许的url。所有满⾜这个正则表达式的url都会被提取。
- deny:禁⽌的url。所有满⾜这个正则表达式的url都不会被提取。
- allow_domains:允许的域名。只有在这个⾥⾯指定的域名的url才会被提 取。
- deny_domains:禁⽌的域名。所有在这个⾥⾯指定的域名的url都不会被提 取。
- restrict_xpaths:严格的xpath。和allow共同过滤链接。
- Rule规则类
定义爬虫规则类
主要参数讲解:
- link_extractor:⼀个LinkExtractor对象,⽤于定义爬取规则。
- callback:满⾜这个规则的url,应该要执⾏哪个回调函数。因为 CrawlSpider使⽤了parse作为回调函数,因此不要覆盖parse作为回调函数 ⾃⼰的回调函数。
- follow:指定根据该规则从response中提取的链接是否需要跟进。
- process_links:从link_extractor中获取到链接后会传递给这个函数,⽤来 过滤不需要爬取的链接。
- 实例
——————————案例——代码————————————
需求:
- 1.爬取古诗文网站
- 2.可以进行翻页
- 3.提取详情页中的作者,朝代,古诗文译文等等。
- 第一步 创建Scrapy项目
- scrapy startproject gs
- 第二步 进入创建好的项目文件夹
- cd gs
- 进入创建好的文件夹
- 第三步 进行爬虫项目的创建
- scrapy genspider -t crawl cgs gushiwen.org
创建爬虫项目
- 第四步 进行爬虫代码编写以及scrapy中的设置
- ---------爬虫项目代码-----
LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CgsSpider(CrawlSpider): name = 'cgs' allowed_domains = ['gushiwen.org','gushiwen.cn'] start_urls = ['http://gushiwen.org/default_1.aspx'] #Rule是一个类,定义提取url的规则 #LinkExtractor 链接提取器 # allow=r'Items/'存放url(正则 重点) callback 回调函数 follow=True 继续跟进(跟进下一页) rules = ( #列表页1---> #列表页改为cn数据就会源源不断 Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True), #详情页 Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item') ) def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() content = response.xpath('//div[@class="contyishang"]/p/text()').extract() detail = ''.join(content).strip() item['detail_content'] = detail print(item) return item
- ---------Scrapy CrawlSpider设置代码------
# Scrapy settings for gs project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'gs' SPIDER_MODULES = ['gs.spiders'] NEWSPIDER_MODULE = 'gs.spiders' LOG_LEVEL='WARNING' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'gs (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36", 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'gs.middlewares.GsSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'gs.middlewares.GsDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'gs.pipelines.GsPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
- ---------Scrapy CrawlSpider项目启动代码------
cmdline.execute(['scrapy','crawl','cgs'])
- 总结:
#列表页1---> #列表页改为cn数据就会源源不断 Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_\d+.aspx'),follow=True), #详情页 Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item') ) ```
- allow关键字参数中传递url的正则表达式。爬虫通过爬取相同的url进行爬取
- follow=True 继续跟进,进行翻页,爬虫跟进到底。一般用在列表页的提取
- callback是一个回调函数,在回调函数内进行内容提取
- 两个Rule分别执行提取页面列表,提取内容的工作。