爬虫框架Scrapy 之(六) --- scrapy增量爬虫

最新推荐文章于 2024-04-09 17:47:36 发布

baoding4359

最新推荐文章于 2024-04-09 17:47:36 发布

阅读量259

点赞数

文章标签： python 爬虫数据库

原文链接：http://www.cnblogs.com/TMMM/p/11370982.html

版权

增量爬虫

　　在scrapy中有很多的爬虫模版，这些模版都是基于basic模版进行的功能扩展（例如：crawl模版、feed模版等）最常用的是crawl（即增量式爬虫）

　　basicspider的设计理念是：从start_urls里面取出起始的url，使用start_urls来驱动引擎工作

　　增量式爬虫：首先以start_urls中的url为起点，从这些url中不断的请求网页，然后从新的网页中匹配出新的url重新放入调度器的队列进行调度；

　　　　　　　　再从新的url网页中提取新的url，在进行调度。依此类推直到所有的url都匹配完成为止

爬虫文件 spilder/dushu.py 增量爬虫介绍

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 #  在scrapy中有很多的爬虫模版，这些模版都是基于basic模版进行的功能扩展（例如：crawl模版、feed模版等）最常用的是crawl（即增量式爬虫）
 4 #  basicspider的设计理念是：从start_urls里面取出起始的url，使用start_urls来驱动引擎工作
 5 #  增量式爬虫：首先以start_urls中的url为起点，从这些url中不断的请求网页，然后从新的网页中匹配出新的url重新放入调度器的队列进行调度
 6 #     再从新的url网页中提取新的url，在进行调度。依此类推直到所有的url都匹配完成为止
 7 from scrapy.linkextractors import LinkExtractor
 8 # LinkExtractor用于根据一定的规则，从网页中提取新的url
 9 from scrapy.spiders import CrawlSpider, Rule
10 # CrawlSpider：增量爬虫。 Rule：一个规则对象，根据LinkExtractor匹配的网址发起请求，并且回调函数
11 
12 
13 class BookSpider(CrawlSpider):  # 继承自CrawlSpider
14     name = 'book'
15     allowed_domains = ['dushu.com']
16     start_urls = ['https://www.dushu.com/book/1002.html']
17 
18     # rules规则：包含若干个rule对象，每一个rule对象去匹配并且请求一定的url
19     #   callback： 它的回调函数，字符串格式书写
20     #   LinkExtractor对象的匹配规则：
21     #       allow参数代表：正则规则
22     #       restrict_xpaths参数：通过xpath路径来匹配
23     #       restrict_css参数： 通过css来匹配
24     rules = ( #根据Extractor匹配到的网址，来发起请求。
25         # 正则匹配
26         Rule(LinkExtractor(allow=r'/book/1002_[1-6]\.html'), callback='parse_item', follow=True),
27         # xpath匹配
28         # Rule(LinkExtractor(restrict_xpaths='//div[@class="pages"]/a'), callback='parse_item', follow=True),
29         # css匹配
30         # Rule(LinkExtractor(restrict_css='.pages a'), callback='parse_item', follow=True),
31     )
32 
33     def parse_item(self, response):
34         item = {}
35         #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
36         #item['name'] = response.xpath('//div[@id="name"]').get()
37         #item['description'] = response.xpath('//div[@id="description"]').get()
38         return item

解析

items.py 定义要获取数据的模型

 1 import scrapy
 2 
 3 class DushuproItem(scrapy.Item):
 4     title = scrapy.Field()
 5     author = scrapy.Field()
 6     price = scrapy.Field()
 7     publisher = scrapy.Field()
 8     content = scrapy.Field()
 9     author_info = scrapy.Field()
10     mulu = scrapy.Field()

spider/dushu.py 对数据进行解析

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 #  在scrapy中有很多的爬虫模版，这些模版都是基于basic模版进行的功能扩展（例如：crawl模版、feed模版等）最常用的是crawl（即增量式爬虫）
 4 #  basicspider的设计理念是：从start_urls里面取出起始的url，使用start_urls来驱动引擎工作
 5 #  增量式爬虫：首先以start_urls中的url为起点，从这些url中不断的请求网页，然后从新的网页中匹配出新的url重新放入调度器的队列进行调度
 6 #       再从新的url网页中提取新的url，在进行调度。依此类推直到所有的url都匹配完成为止
 7 from scrapy.linkextractors import LinkExtractor
 8 # LinkExtractor用于根据一定的规则，从网页中提取新的url
 9 from scrapy.spiders import CrawlSpider, Rule
10 # CrawlSpider ：增量爬虫。 Rule ：一个规则对象，根据LinkExtractor匹配的网址发起请求，并且回调函数
11 from DushuPro.items import DushuproItem
12 
13 class BookSpider(CrawlSpider):  # 继承自CrawlSpider
14     name = 'book'
15     allowed_domains = ['dushu.com']
16     start_urls = ['https://www.dushu.com/book/1002.html']
17 
18     # rules规则：包含若干个rule对象，每一个rule对象去匹配并且请求一定的url
19     #   callback： 它的回调函数，字符串格式书写
20     #   LinkExtractor对象的匹配规则：
21     #       allow参数代表：正则规则
22     #       restrict_xpaths参数：通过xpath路径来匹配
23     #       restrict_css参数： 通过css来匹配
24     rules = ( #根据Extractor匹配到的网址，来发起请求。
25         # 正则匹配
26         Rule(LinkExtractor(allow=r'/book/1002_[1-6]\.html'), callback='parse_item', follow=True),
27         # xpath匹配
28         # Rule(LinkExtractor(restrict_xpaths='//div[@class="pages"]/a'), callback='parse_item', follow=True),
29         # css匹配
30         # Rule(LinkExtractor(restrict_css='.pages a'), callback='parse_item', follow=True),
31     )
32 
33     # 数据解析函数对爬取的信息进行解析
34     def parse_item(self, response):
35         # 解析一级页面。先找到所有的书籍
36         book_list = response.xpath('//div[@class="bookslist"]/ul/li')
37         for book in book_list:
38             item = DushuproItem()
39             item['title'] = book.xpath('.//h3//text()').extract_first() # 取第一个
40             item['author'] = "".join(book.xpath('.//p[1]//text()').extract()) # 取多个
41             # item 目前未解析完，其他的内容在当前书本二级页面中
42 
43             # 访问二级页面。获取二级页面的链接
44             next_url = "https://www.dushu.com" + book.xpath('.//h3/a/@href').extract_first()
45             # 正式访问二级页面。涉及到翻页，需要重新调度get下载器，对二级页面进行请求
46             yield scrapy.Request(url=next_url,callback=self.parse_info,meta={"item":item})
47             # meta参数：是response响应对象的一个属性，每次下载器下载完一个url以后都会封装一个response对象并且把响应体和响应头等信息传递过来
48             # 其中response中有一个属性叫做meta，它是存储配置信息，现在我们把item作为response的meta信息。在二级函数中接收一下，就可以把item传过去
49 
50     # 解析二级页面，并且把上级页面中的item补充完整
51     def parse_info(self,response):
52         # 如何获取上一级页面的item，需要在yield中加一个item参数，将上一级页面中的item数值传到二级页面
53         # print(response.meta)         # 此时response的meta中就携带上一级页面的item
54         item = response.meta["item"]   # 接收上一级页面的item，在二级页面继续解析
55         item["price"] = response.xpath("//span[@class='num']/text()").extract_first()
56         item["publisher"] = response.xpath("//tr[2]//td[2]/a/text()").extract()[0] if response.xpath(
57             "//tr[2]//td[2]/a/text()").extract() else ""
58         item["content"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[0]
59         item["author_info"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[1]
60         item["mulu"] = response.xpath("//div[contains(@class,'text txtsummary')]")[2].xpath(".//text()").extract()
61 
62         yield item

存储

pipelines.py文件中对数据进行存储

1 import redis
2 class DushuproPipeline(object):
3 
4     def open_spider(self,spider):
5         self.rds = redis.StrictRedis(host="www.fanjianbo.com",port=6379,db=3)
6 
7     def process_item(self, item, spider):
8         self.rds.lpush("books",item)
9         return item

转载于:https://www.cnblogs.com/TMMM/p/11370982.html

baoding4359

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫框架Scrapy 之(六) --- scrapy增量爬虫

增量爬虫　　在scrapy中有很多的爬虫模版，这些模版都是基于basic模版进行的功能扩展（例如：crawl模版、feed模版等）最常用的是crawl（即增量式爬虫）　　basicspider的设计理念是：从start_urls里面取出起始的url，使用start_urls来驱动引擎工作　　增量式爬虫：首先以start_urls中的url为起点，从这些url中不断的请求网页，...
复制链接

扫一扫