3.1 scrapy框架 -- Spider详解


1.spider职责所在
 1 # spider主要就负责两件事:
 2   1.定义爬取网站的动作
 3   2.解析响应数据,获取目标数据传递给pipeline进行持久化存储或采集目标url进行下一步请求
 4 
 5 # spider初创文件代码:
 6 import scrapy
 7 classDddSpider(scrapy.Spider):
 8 name ='ddd'
 9 allowed_domains =['www.baidu.com']
10 start_urls =['http://www.baidu.com/']
11 def parse(self, response):
12   pass
13 
14 # 针对初创spider文件代码对上面的两点解释:
15   1.scrapy框架默认情况下实现start_urls的请求,即项目启动后,框架会自动对start_urls后面的地址进行第一次请求,就是我们所谓的爬取网站的动作,如果有多个url需要请求,可以注释掉start_urls,自定义start_requests方法,后面会有示例详解.
16   2.start_urls请求后的相应被自动传递给parse()方法,即参数中的response. parse()方法是对响应数据进行解析的,一般在parse()中可以调用各种解析库对数据进行解析,但一般选用selector选择器结合xpath表达式或css选择器进行数据解析.
17   3.parse()方法中解析得到的数据一般有两种情况:
18     (1).目标数据:此时可在parse()中实例化item对象并返回该对象给pipeline进行持久化存储
19     (2).目标url:此时可以返回一个request对象并在对象中绑定解析方法,进行下一次请求解析循环操作.

返回顶部


2.Spider类详解
(1).spider类的基本属性
1 (1).name:爬虫名称.定义了spider的名称,所有spider的名字都不能重复,所以一般以网站域名命名(重要)
2 
3 (2).allowed_domains:允许爬取的域名,可选配置,不在此范围的连接不会被跟进爬取
4 
5 (3).start_urls:其实url.在没有自定义start_requests的情况下,框架会默认从start_url开始抓取(重要)
6 
7 (4).custom_settings:一个字段,专属与该spider的配置,此配置会覆盖项目全局的设置,必须为类变量.
8 
9 (5).crawler:它是由from_crawler()方法设置的,代表的是本spider类对应的Crawler对象,通过他可以获取项目的设置信息,如settings.

返回顶部

(2).spider常用方法
1 (1).start_requests():此方法用于生成初始请求,它必须返回一个可迭代对象.此方法会默认使用start_urls里面的url来构造Request,而且request是get请求方式如果我们想在启动时以POST方式访问某个站点,可以直接重写这个方法,发送POST请求时使用FormRequest即可.
2 
3 (2).parse():当Response没有指定回调函数时,该方法会默认被调用。它负责处理Response,处理返回结果,并从中提取出想要的数据和下一步的请求,然后返回。该方法需要返回一个包含Request.或item的可迭代对象。
4 
5 (3).closed():当Spider关闭时,该方法会被调用,在这里一般会定义释放资源的一些操作或其他收尾操作。

返回顶部


3.简单示例
1 # 需求: 爬取瓜子二手车首页热卖车型下的最新上架的车辆详细信息, 包括车辆描述标题, 上牌时间, 表显里程, 排量, 车主报价等信息. 并将信息存储在MongoDB数据库中
2 
3 # 思路:
4   1.手动实现start_requests方法,注意一定要构建完整的headers,其中必须包含cookie信息,否则请求不到目标数据,第一次请求为获取详情页url地址,所以指定的解析方法中要解析出详情页的url地址
5   2.在parse方法中获取到详情页的url地址后,再次构建request请求获取详情页内容,并指定第二次请求的解析方法,该解析方法中要获取标题,里程,排量等信息,并构建item对象,将信息存储在item中,传递给管道
6   3.在管道中存储接收到的item对象

返回顶部

 1 # spider文件中定义爬取数据的动作, 共两次请求, 并指定响应的解析方法
 2 import scrapy
 3 from scrapy importRequest
 4 from guazicar.items importGuazicarItem
 5 classGzcSpider(scrapy.Spider):
 6   name ='gzc'
 7 
 8 # 构建请求需要的请求头信息, 注意必须携带cookie, 且cookie必须放在headers中
 9   headers ={
10     "Accept":'text/html, application/xhtml + xml, application/xml;q = 0.9, image / webp, image / apng, * / *;q = 0.8, application / signed - exchange;v = b3',
11     "Accept - Encoding":"gzip, deflate, br",
12     "Accept - Language":"zh - CN, zh;q = 0.9, en;q = 0.8",
13     "Cache - Control":"no - cache",
14     "Connection":"keep - alive",
15     "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
16     "Host":" www.guazi.com",
17     "Pragma":"no - cache",
18     "Referer":"https://www.guazi.com/bj/sell/?clueS=01",
19     "Upgrade-Insecure-Requests":1,
20     'Cookie':'uuid=06d6e503-7ae0-4a0b-c210-3ff1abee2e70; ganji_uuid=9858880605988539539234; lg=1; antipas=8246994L24y6400058190652V56; clueSourceCode=10103000312%2300; user_city_id=12; sessionid=28f2424e-eddf-4d6c-a175-f0a9bff05a99; guazi_tracker_session_pageload_snapshot=%7B%22common%22%3A%7B%22line%22%3A%22c2c%22%2C%22platform%22%3A%22web%22%2C%22pagetype%22%3A%22sell%22%7D%2C%22trackings%22%3A%5B%7B%22guid%22%3A%2206d6e503-7ae0-4a0b-c210-3ff1abee2e70%22%2C%22userid%22%3A%22-%22%2C%22sessionid%22%3A%2228f2424e-eddf-4d6c-a175-f0a9bff05a99%22%2C%22page%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fsell%2F%3FclueS%3D01%22%2C%22referer%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fhuracan%2Fr19%2F%22%2C%22referrer%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fhuracan%2Fr19%2F%22%2C%22city%22%3A%22bj%22%2C%22landing%22%3A%220%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22tbmkbturl%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%2210103000312%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22version%22%3A1%2C%22client_ab%22%3A%22-%22%2C%22os%22%3A%22Windows%2010%22%2C%22screen_resolution%22%3A%221536%20x%20864%22%2C%22view_port%22%3A%221519%20x%20722%22%2C%22is_native%22%3A%220%22%2C%22tracker_version%22%3A%221.6.11%22%2C%22clientTime%22%3A%221560386906116%22%2C%22tracking_type%22%3A%22pageload%22%2C%22pageid%22%3A%2203ed421b-71d9-40c0-cb7d-c6388945fa48%22%7D%5D%7D; cainfo=%7B%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22tbmkbturl%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%2210103000312%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%2206d6e503-7ae0-4a0b-c210-3ff1abee2e70%22%2C%22sessionid%22%3A%2228f2424e-eddf-4d6c-a175-f0a9bff05a99%22%2C%22ca_b%22%3A%22-%22%2C%22ca_a%22%3A%22-%22%7D; preTime=%7B%22last%22%3A1560386825%2C%22this%22%3A1560256593%2C%22pre%22%3A1560256593%7D; _gl_tracker=%7B%22ca_source%22%3A%22-%22%2C%22ca_name%22%3A%22-%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_id%22%3A%22-%22%2C%22ca_s%22%3A%22self%22%2C%22ca_n%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22sid%22%3A32099126900%7D; cityDomain=bj'
21   }
22 
23 # 手动实现第一次请求, 获取详情页信息
24 def start_requests(self):
25   yieldRequest(url='https://www.guazi.com/bj/', callback=self.parse, method='GET',headers=self.headers,)
26 
27 # 第二次请求的解析方法, 获取详情页目标数据, 实例化item对象传递给管道
28 def detial_parse(self, response):
29   title = response.selector.xpath('//div[@class="product-textbox"]/h2[@class="titlebox"]/text()')
30   time = response.selector.xpath('//ul[@class="assort clearfix"]/li[1]/span/text()')
31   mileage = response.selector.xpath('//ul[@class="assort clearfix"]/li[2]/span/text()')
32   swept_volume = response.selector.xpath('//ul[@class="assort clearfix"]/li[3]/span/text()')
33   price = response.selector.xpath('//div[@class="pricebox js-disprice"]/span[@class="pricestype"]/text()')
34   item =GuazicarItem()
35   item["title"]= title.extract_first()
36   item["time"]= time.extract_first()
37   item["mileage"]= mileage.extract_first()
38   item["swept_volume"]= swept_volume.extract_first()
39   item["price"]= price.extract_first()
40   return item
41 
42 # 第一次请求的的解析方法, 解析出详情页url后进行第二次请求
43 def parse(self, response):
44   li_list = response.selector.xpath('//ul[@class="carlist clearfix"]/li')
45   for li in li_list:
46     url = li.xpath('./a/@href').extract_first()
47     full_url ='https://www.guazi.com'+ url
48     yieldRequest(url=full_url, callback=self.detial_parse, headers=self.headers)
1 # item文件, 定义目标数据存储的数据结构
2 import scrapy
3 classGuazicarItem(scrapy.Item):
4   title = scrapy.Field()# 标题
5   time = scrapy.Field()# 上牌时间
6   mileage = scrapy.Field()# 里程数
7   swept_volume = scrapy.Field()# 排量
8   price = scrapy.Field()# 报价

返回顶部

 1 # pipeline文件中定义数据持久化到MongoDB数据库中
 2 import pymongo
 3 classGuazicarPipeline(object):
 4     
 5     mongo_client = pymongo.MongoClient('127.0.0.1:27017')
 6     mongo_db = mongo_client['guazicai']
 7 
 8 def process_item(self, item, spider):
 9     name =self.__class__.__name__
10     self.mongo_db[name].insert(dict(item))
11     return item
12 
13 def close_spider(self):
14     self.mongo_client.close()
1 # settings文件中修改的配置项
2 COOKIES_ENABLED =False
3 ITEM_PIPELINES ={
4 'guazicar.pipelines.GuazicarPipeline':300,
5 }

返回顶部

转载于:https://www.cnblogs.com/Jermy/articles/11019827.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值