一、scrapy概述
简介:Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架
应用领域:数据挖掘、数据分析等领域
安装方式 :pip install scrapy
1.1 常用命令:
scrapy -h #查命令 创建项目命令:scrapy startproject 项目名称 创建爬虫:scrapy genspider 爬虫的名称(唯一标识) 网页的域名 调用爬虫:scrapy crawl 爬虫名称 还有一种方法: 编写脚本,在脚本中导入cmdline,调用execute()调用爬虫: from scrapy import cmdline cmdline.execute("scrapy crawl 爬虫名称".split()) 打开scrapy交互环境:scrapy shell 1.直接请求URL: scrapy shell url地址 #请求后 response.status response.url response.text 2. 请求url设置请求信息 scrapy shell -s name=value 例如: scrapy shell -s USER_AGENT='Mozilla/5.0' url地址 3. 命令行请求url 直接进入shell: scrapy shell from scrapy import Request req = Request(url,headers={}) #实例化对象 res = fetch(req) #更新局部对象 response.status response.url response.text
二、Xpath选取节点(集)
XPath选择器:response.xpath("xpath选择器"):返回SelectorList 基本语法见https://blog.csdn.net/weixin_42569562/article/details/84670604?from=singlemessag
response.xpath("xpath选择器").extract():提取列表
2.1 xpath常用函数:
text() 提取文本信息last() 选取最后一个节点 eg: //*[last()] starts-with(@attr,strh) attr属性值开头为strh的节点contains(@attr,substr) attr属性值中是否包含substr
-
豆瓣电影top250数据: 1. 提取电影名称 response.xpath("//div[@class='hd']/a/span[1]/text()").extract() 提取获取的第5个名称 response.xpath("//div[@class='hd']/a/span[1]/text()").extract()[5] 2. 提取电影评分 response.xpath("//span[@class='rating_num']/text()").extract() 3. 提取电影图片地址 response.xpath("//div[@class='pic']/a/img/@src").extract() 4.获取评论人数 response.xpath("//div[@class='star']/span[last()]/text()").extract()
2.2 CSS选择器:response.css("css选择器")
response.css("css选择器语法").extract()提取内容列表 .classvalue idvalue div~p:前面有div元素的每个p元素 li span ::text 提取文本信息的值 例如:response.css("a>span::text").extract() ::attr(src) 提取某个属性的值 例如:response.css("li a::attr('href')").extract() 属性选择器: [attr] 选择所有带attr属性的元素 [attr=val] 选择attr值为val的所有元素 [attr^=val] 选择attr属性值以val开头的元素 [attr$=val] 选择attr属性值以val结尾的元素 [attr*=val] 选择attr属性值包含val的元素 例如: response.css("[class='类名称']::text").extract() 豆瓣top250 1.选择电影名称: response.css("div.hd > a > span.title:first-child::text").extract() 2.选择图片地址 response.css("div.pic>a>img::attr('src')").extract() 3.选择class属性为inq的标签中的文本列表 response.css("[class='inq']::text").extract()
2.3 代码演示
# -*- coding: utf-8 -*- import scrapy from scrapy.http import HtmlResponse class InfoSpider(scrapy.Spider): name = 'info' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/get'] def parse(self, response:HtmlResponse): print("*" * 100) print("响应文本:"+response.text) print("响应状态码:",response.status) print("响应接收到的URL:",response.url) print("请求URL:",response.request.url) print("*" * 100) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl info".split())
2、https://movie.douban.com/top250
# -*- coding: utf-8 -*- import scrapy class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): print("*"*100) movies = response.xpath("//div[@class='item']") # print(movies) pics = [] names = [] for movie_item in movies: # movie_item是Selector对象,Selector对象也可以使用xpath()方法 pic_src = movie_item.xpath("div[@class='pic']//img//@src").extract()[0] movie_name = movie_item.xpath("div[@class='info']/div[@class='hd']/a/span[1]/text()").extract()[0] pics.append(pic_src) names.append(movie_name) print("电影图片列表:",pics) print("电影名称列表:",names) print("*" * 100) if __name__ == '__main__': from scrapy import cmdline cmdline.execute("scrapy crawl douban".split())
# -*- coding: utf-8 -*- import scrapy class QbSpider(scrapy.Spider): name = 'qb' allowed_domains = ['www.qiushibaike.com'] start_urls = ['http://www.qiushibaike.com/'] def parse(self, response): print("*"*100) articles = response.xpath("//div[starts-with(@class,'article ')]") pics = [] nicknames = [] for article in articles: pic_src = "https:" + article.xpath("div[starts-with(@class,'author')]//img/@src").extract()[0] nickname = article.xpath("div[starts-with(@class,'author')]//h2/text()").extract()[0] pics.append(pic_src) nicknames.append(nickname) print("头像地址列表:",pics) print("作者名字列表:",nicknames) print("*" * 100)
4、https://music.163.com/discover
# -*- coding: utf-8 -*- import scrapy class MusicSpider(scrapy.Spider): name = 'music' allowed_domains = ['music.163.com'] start_urls = ['https://music.163.com/discover'] def parse(self, response): oldimglist = response.xpath("//div[@class='u-cover u-cover-1']/img/@src").extract() imgneedlist = [] for imgsrc in oldimglist: print("imgsrc:",imgsrc) if imgsrc.startswith('http'): imgneedlist.append(imgsrc) print("*" * 100) for img in imgneedlist: print(img) print("*" * 100)
二、Scrapy工作流程
1、爬虫引擎获得初始请求开始抓取。
2、爬虫引擎开始请求调度程序,并准备对下一次的请求进行抓取。
3、爬虫调度器返回下一个请求给爬虫引擎。
4、引擎请求发送到下载器,通过下载中间件下载网络数据。
5、一旦下载器完成页面下载,将下载结果返回给爬虫引擎。
6、引擎将下载器的响应通过中间件返回给爬虫进行处理。
7、爬虫处理响应,并通过中间件返回处理后的items,以及新的请求给引擎。
8、引擎发送处理后的items到项目管道,然后把处理结果返回给调度器,调度器计划处理下一个请求抓取。
9、重复该过程(继续步骤1),直到爬取完所有的url请求
三、Spider二次请求
yield一个Request对象
yield Request(url,callback=回调函数) 二次请求,并指定回调函数,也可以通过meta传参 yield Request(url,meta=传递的数据,callback=回调函数名) 例如: param_dict = { "author_name":author_name } # 二次请求,并指定回调函数,也可以通过meta传参 yield Request(author_url,meta=param_dict,callback=self.personal_s
3.1 代码演示
# -*- coding: utf-8 -*- import scrapy from scrapy import Request class MultiRequestsSpider(scrapy.Spider): name = 'multi' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): base_url = 'https://movie.douban.com/top250' lista = response.xpath("//div[@class='hd']/a") movie_links = [] # 收集当前页的电影详情地址 for a in lista: movie_link = a.xpath("@href").extract()[0] # 每个电影的地址 movie_links.append(movie_link) movie_name = a.xpath("span[1]/text()").extract()[0] print(movie_name) for movie_url in movie_links: yield Request(movie_url) next_page = response.xpath("//span[@class='next']/a/@href").extract() # 获取“后页”超链接的href属性值 if next_page: next_url = base_url + next_page[0] yield Request(next_url,callback=self.parse)#回调函数
2、https://www.qiushibaike.com/
# -*- coding: utf-8 -*- import scrapy from scrapy import Request class QbSpider(scrapy.Spider): name = 'qb' allowed_domains = ['www.qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/'] def parse(self, response): authors = response.xpath("//div[starts-with(@class,'author ')]") base_url = 'https://www.qiushibaike.com' for author in authors: links = author.xpath("a[1]/@href").extract() if links: link = links[0] author_url = base_url + link # 拼接个人空间的完整网址 author_name = author.xpath("a[2]/h2/text()").extract()[0] param_dict = { "author_name":author_name } # 二次请求,并指定回调函数,也可以通过meta传参 yield Request(author_url,meta=param_dict,callback=self.personal_space) def personal_space(self,response): data = response.meta # 接收发送来的数据 print("接收到的作者名称是:"+data["author_name"]) print("*" * 100)
四、item详解
1、作用:item用来保存提取的数据,类似于字典
2、提取内容存储:
支持的格式:csv, json, xmlJSON格式:scrapy crawl 爬虫名称 -t json -o xx.json指定编码格式:JSON格式:scrapy crawl 爬虫名称 -t json -o xx.json -s FEED_EXPORT_ENCODING='utf-8'
CSV格式:scrapy crawl 爬虫名称 -t csv -o xx.csv
xml格式:scrapy crawl 爬虫名称 -t xml -o xx.xml
3、字段类型:Field可以指向任意类型的字段
4、item赋值方式:item[name]='xxx'
5、Spider中使用yield处理Item对象
4.1、代码演示
#qiubai_item.py # -*- coding: utf-8 -*- import scrapy from ScrapyDay2.items import QiubaiItem class QiubaiItemSpider(scrapy.Spider): name = 'qiubai_item' allowed_domains = ['www.qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/'] def parse(self, response): imgs = response.xpath("//div[starts-with(@class,'author ')]/a/img/@src").extract() for img in imgs: qbitem = QiubaiItem(figure_path="https:"+img) # 实例化Item对象,并保存爬取的数据 yield qbitem print("*"*100) # print("保存成功")
#items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class QiubaiItem(scrapy.Item): figure_path = scrapy.Field()
注意:终端调用
JSON格式:scrapy crawl 爬虫名称 -t json -o xx.json -s FEED_EXPORT_ENCODING='utf-8' CSV格式:scrapy crawl 爬虫名称 -t csv -o xx.csv xml格式:scrapy crawl 爬虫名称 -t xml -o xx.xml
五、Item Pipeline
1、作用:处理Item
2、实现方式:实现相关方法进行处理
open_spider:Spider打开时调用close_spider:Spider结束时调用process_item:item处理from_crawler(类方法):创建Pipeline实例 注意:可以在from_crawler类方法中读取settings.py中的设置:例如: @classmethod def from_crawler(cls,crawler): return cls(savepath=crawler.settings.get("JSON_PATH"))
3、在settings.py中使用ITEM_PIPELINES注册
5.1 代码演示
#pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class QiuBaiPipeline(object): def __init__(self,savepath): self.savefile = open(savepath,'w',encoding='utf-8') def close_spider(self, spider): self.savefile.close() def process_item(self, item, spider): data = json.dumps(dict(item)) + "\n" self.savefile.write(data) return item @classmethod def from_crawler(cls, crawler): return cls(savepath=crawler.settings.get("SAVE_PATH1")) # 实例化一个Pipeline对象,并返回,传递savepath参数 class QiuBaiPipeline2(object): def __init__(self,savepath): self.savefile = open(savepath, 'w', encoding='utf-8') def close_spider(self, spider): self.savefile.close() def process_item(self, item, spider): figure_path = item["figure_path"] + "\n" self.savefile.write(figure_path) return item @classmethod def from_crawler(cls, crawler): return cls(savepath=crawler.settings.get("SAVE_PATH2")) # 实例化一个Pipeline对象,并返回,传递savepath参数
#setting。py ITEM_PIPELINES = { 'ScrapyDay2.pipelines.QiuBaiPipeline': 300, 'ScrapyDay2.pipelines.QiuBaiPipeline2': 301, } SAVE_PATH1 = 'qb_path1.txt' SAVE_PATH2 = 'qb_path2.txt'