1.spider职责所在
1 # spider主要就负责两件事:
2 1.定义爬取网站的动作
3 2.解析响应数据,获取目标数据传递给pipeline进行持久化存储或采集目标url进行下一步请求
4
5 # spider初创文件代码:
6 import scrapy
7 classDddSpider(scrapy.Spider):
8 name ='ddd'
9 allowed_domains =['www.baidu.com']
10 start_urls =['http://www.baidu.com/']
11 def parse(self, response):
12 pass
13
14 # 针对初创spider文件代码对上面的两点解释:
15 1.scrapy框架默认情况下实现start_urls的请求,即项目启动后,框架会自动对start_urls后面的地址进行第一次请求,就是我们所谓的爬取网站的动作,如果有多个url需要请求,可以注释掉start_urls,自定义start_requests方法,后面会有示例详解.
16 2.start_urls请求后的相应被自动传递给parse()方法,即参数中的response. parse()方法是对响应数据进行解析的,一般在parse()中可以调用各种解析库对数据进行解析,但一般选用selector选择器结合xpath表达式或css选择器进行数据解析.
17 3.parse()方法中解析得到的数据一般有两种情况:
18 (1).目标数据:此时可在parse()中实例化item对象并返回该对象给pipeline进行持久化存储
19 (2).目标url:此时可以返回一个request对象并在对象中绑定解析方法,进行下一次请求解析循环操作.
返回顶部
2.Spider类详解
(1).spider类的基本属性
1 (1).name:爬虫名称.定义了spider的名称,所有spider的名字都不能重复,所以一般以网站域名命名(重要)
2
3 (2).allowed_domains:允许爬取的域名,可选配置,不在此范围的连接不会被跟进爬取
4
5 (3).start_urls:其实url.在没有自定义start_requests的情况下,框架会默认从start_url开始抓取(重要)
6
7 (4).custom_settings:一个字段,专属与该spider的配置,此配置会覆盖项目全局的设置,必须为类变量.
8
9 (5).crawler:它是由from_crawler()方法设置的,代表的是本spider类对应的Crawler对象,通过他可以获取项目的设置信息,如settings.
返回顶部
(2).spider常用方法
1 (1).start_requests():此方法用于生成初始请求,它必须返回一个可迭代对象.此方法会默认使用start_urls里面的url来构造Request,而且request是get请求方式如果我们想在启动时以POST方式访问某个站点,可以直接重写这个方法,发送POST请求时使用FormRequest即可.
2
3 (2).parse():当Response没有指定回调函数时,该方法会默认被调用。它负责处理Response,处理返回结果,并从中提取出想要的数据和下一步的请求,然后返回。该方法需要返回一个包含Request.或item的可迭代对象。
4
5 (3).closed():当Spider关闭时,该方法会被调用,在这里一般会定义释放资源的一些操作或其他收尾操作。
返回顶部
3.简单示例
1 # 需求: 爬取瓜子二手车首页热卖车型下的最新上架的车辆详细信息, 包括车辆描述标题, 上牌时间, 表显里程, 排量, 车主报价等信息. 并将信息存储在MongoDB数据库中
2
3 # 思路:
4 1.手动实现start_requests方法,注意一定要构建完整的headers,其中必须包含cookie信息,否则请求不到目标数据,第一次请求为获取详情页url地址,所以指定的解析方法中要解析出详情页的url地址
5 2.在parse方法中获取到详情页的url地址后,再次构建request请求获取详情页内容,并指定第二次请求的解析方法,该解析方法中要获取标题,里程,排量等信息,并构建item对象,将信息存储在item中,传递给管道
6 3.在管道中存储接收到的item对象
返回顶部
1 # spider文件中定义爬取数据的动作, 共两次请求, 并指定响应的解析方法
2 import scrapy
3 from scrapy importRequest
4 from guazicar.items importGuazicarItem
5 classGzcSpider(scrapy.Spider):
6 name ='gzc'
7
8 # 构建请求需要的请求头信息, 注意必须携带cookie, 且cookie必须放在headers中
9 headers ={
10 "Accept":'text/html, application/xhtml + xml, application/xml;q = 0.9, image / webp, image / apng, * / *;q = 0.8, application / signed - exchange;v = b3',
11 "Accept - Encoding":"gzip, deflate, br",
12 "Accept - Language":"zh - CN, zh;q = 0.9, en;q = 0.8",
13 "Cache - Control":"no - cache",
14 "Connection":"keep - alive",
15 "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
16 "Host":" www.guazi.com",
17 "Pragma":"no - cache",
18 "Referer":"https://www.guazi.com/bj/sell/?clueS=01",
19 "Upgrade-Insecure-Requests":1,
20 'Cookie':'uuid=06d6e503-7ae0-4a0b-c210-3ff1abee2e70; ganji_uuid=9858880605988539539234; lg=1; antipas=8246994L24y6400058190652V56; clueSourceCode=10103000312%2300; user_city_id=12; sessionid=28f2424e-eddf-4d6c-a175-f0a9bff05a99; guazi_tracker_session_pageload_snapshot=%7B%22common%22%3A%7B%22line%22%3A%22c2c%22%2C%22platform%22%3A%22web%22%2C%22pagetype%22%3A%22sell%22%7D%2C%22trackings%22%3A%5B%7B%22guid%22%3A%2206d6e503-7ae0-4a0b-c210-3ff1abee2e70%22%2C%22userid%22%3A%22-%22%2C%22sessionid%22%3A%2228f2424e-eddf-4d6c-a175-f0a9bff05a99%22%2C%22page%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fsell%2F%3FclueS%3D01%22%2C%22referer%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fhuracan%2Fr19%2F%22%2C%22referrer%22%3A%22https%3A%2F%2Fwww.guazi.com%2Fbj%2Fhuracan%2Fr19%2F%22%2C%22city%22%3A%22bj%22%2C%22landing%22%3A%220%22%2C%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22tbmkbturl%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%2210103000312%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22version%22%3A1%2C%22client_ab%22%3A%22-%22%2C%22os%22%3A%22Windows%2010%22%2C%22screen_resolution%22%3A%221536%20x%20864%22%2C%22view_port%22%3A%221519%20x%20722%22%2C%22is_native%22%3A%220%22%2C%22tracker_version%22%3A%221.6.11%22%2C%22clientTime%22%3A%221560386906116%22%2C%22tracking_type%22%3A%22pageload%22%2C%22pageid%22%3A%2203ed421b-71d9-40c0-cb7d-c6388945fa48%22%7D%5D%7D; cainfo=%7B%22ca_s%22%3A%22pz_baidu%22%2C%22ca_n%22%3A%22tbmkbturl%22%2C%22ca_medium%22%3A%22-%22%2C%22ca_term%22%3A%22-%22%2C%22ca_content%22%3A%22%22%2C%22ca_campaign%22%3A%22%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22scode%22%3A%2210103000312%22%2C%22keyword%22%3A%22-%22%2C%22ca_keywordid%22%3A%22-%22%2C%22ca_transid%22%3A%22%22%2C%22platform%22%3A%221%22%2C%22version%22%3A1%2C%22display_finance_flag%22%3A%22-%22%2C%22client_ab%22%3A%22-%22%2C%22guid%22%3A%2206d6e503-7ae0-4a0b-c210-3ff1abee2e70%22%2C%22sessionid%22%3A%2228f2424e-eddf-4d6c-a175-f0a9bff05a99%22%2C%22ca_b%22%3A%22-%22%2C%22ca_a%22%3A%22-%22%7D; preTime=%7B%22last%22%3A1560386825%2C%22this%22%3A1560256593%2C%22pre%22%3A1560256593%7D; _gl_tracker=%7B%22ca_source%22%3A%22-%22%2C%22ca_name%22%3A%22-%22%2C%22ca_kw%22%3A%22-%22%2C%22ca_id%22%3A%22-%22%2C%22ca_s%22%3A%22self%22%2C%22ca_n%22%3A%22-%22%2C%22ca_i%22%3A%22-%22%2C%22sid%22%3A32099126900%7D; cityDomain=bj'
21 }
22
23 # 手动实现第一次请求, 获取详情页信息
24 def start_requests(self):
25 yieldRequest(url='https://www.guazi.com/bj/', callback=self.parse, method='GET',headers=self.headers,)
26
27 # 第二次请求的解析方法, 获取详情页目标数据, 实例化item对象传递给管道
28 def detial_parse(self, response):
29 title = response.selector.xpath('//div[@class="product-textbox"]/h2[@class="titlebox"]/text()')
30 time = response.selector.xpath('//ul[@class="assort clearfix"]/li[1]/span/text()')
31 mileage = response.selector.xpath('//ul[@class="assort clearfix"]/li[2]/span/text()')
32 swept_volume = response.selector.xpath('//ul[@class="assort clearfix"]/li[3]/span/text()')
33 price = response.selector.xpath('//div[@class="pricebox js-disprice"]/span[@class="pricestype"]/text()')
34 item =GuazicarItem()
35 item["title"]= title.extract_first()
36 item["time"]= time.extract_first()
37 item["mileage"]= mileage.extract_first()
38 item["swept_volume"]= swept_volume.extract_first()
39 item["price"]= price.extract_first()
40 return item
41
42 # 第一次请求的的解析方法, 解析出详情页url后进行第二次请求
43 def parse(self, response):
44 li_list = response.selector.xpath('//ul[@class="carlist clearfix"]/li')
45 for li in li_list:
46 url = li.xpath('./a/@href').extract_first()
47 full_url ='https://www.guazi.com'+ url
48 yieldRequest(url=full_url, callback=self.detial_parse, headers=self.headers)
1 # item文件, 定义目标数据存储的数据结构
2 import scrapy
3 classGuazicarItem(scrapy.Item):
4 title = scrapy.Field()# 标题
5 time = scrapy.Field()# 上牌时间
6 mileage = scrapy.Field()# 里程数
7 swept_volume = scrapy.Field()# 排量
8 price = scrapy.Field()# 报价
返回顶部
1 # pipeline文件中定义数据持久化到MongoDB数据库中
2 import pymongo
3 classGuazicarPipeline(object):
4
5 mongo_client = pymongo.MongoClient('127.0.0.1:27017')
6 mongo_db = mongo_client['guazicai']
7
8 def process_item(self, item, spider):
9 name =self.__class__.__name__
10 self.mongo_db[name].insert(dict(item))
11 return item
12
13 def close_spider(self):
14 self.mongo_client.close()
1 # settings文件中修改的配置项
2 COOKIES_ENABLED =False
3 ITEM_PIPELINES ={
4 'guazicar.pipelines.GuazicarPipeline':300,
5 }
返回顶部