scrapy框架

最新推荐文章于 2022-06-27 23:21:17 发布

王家——王炎

最新推荐文章于 2022-06-27 23:21:17 发布

阅读量151

点赞数 1

分类专栏： scrapy框架文章标签： scrapy框架

本文链接：https://blog.csdn.net/weixin_45126952/article/details/98759848

版权

scrapy框架专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Scrapy 框架介绍与简单案例
2018年01月09日 11:06:06 gxh_apologize 阅读数 1952
版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.net/GXH_APOLOGIZE/article/details/79010585
一、Scrapy介绍
Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛
Scrapy 使用了 Twisted(其主要对手是Tornado)异步网络框架来处理网络通讯，可以加快我们的下载速度，不用自己去实现异步框架，并且包含了各种中间件接口，可以灵活的完成各种需求
官网：http://doc.scrapy.org/en/latest
中文维护站点：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
二、Scrapy安装
参考文档：https://doc.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes

我是在Ubuntu16.04下安装的：

安装非Python依赖：

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
1
通过pip安装：
sudo pip install scrapy
1
三、Scrapy项目目录介绍
我们先建一个Scrapy项目，项目名teshi：

scrapy startproject teshi
1
项目目录结构如下：
目录结构

scrapy.cfg ：项目的配置文件
teshi/ ：项目的Python模块，将会从这里引用代码
teshi/items.py ：项目的目标文件
teshi/pipelines.py ：项目的管道文件
teshi/settings.py ：项目的设置文件
teshi/spiders/ ：存储爬虫代码目录
四、项目流程与第一个入门案例
这个案例是爬取某机构官网教师信息，通过这个案例可以对Scrapy框架使用有一个基本了解。
1、新建项目mySpider

scrapy startproject mySpider
1
2、修改项目设置文件mySpider/settings.py

DEFAULT_REQUEST_HEADERS = {
‘User-Agent’:‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,

‘Accept-Language’: ‘en’,

}
3、修改项目目标文件mySpider/items.py，定义结构化数据字段，用来保存爬取到的数据

1 # -- coding: utf-8 --
2
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # http://doc.scrapy.org/en/latest/topics/items.html
7
8 import scrapy
9
10
11 class MyspiderItem(scrapy.Item):
12 # define the fields for your item here like:
13
14 name = scrapy.Field()
15 title=scrapy.Field()
16 info=scrapy.Field()
17 #pass
4、编写爬虫代码，在mySpider/mySpider/spiders目录下新建itcastSpider.py（可以命令创建，后期介绍，这里手动创建吧）

1 # coding:utf-8
2
3 import scrapy
4 from mySpider.items import MyspiderItem
5
6 class ItcastSpider(scrapy.Spider):
7 # 爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字
8 name=“itcast”
9 # 爬虫的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略
10 allowd_domains=[“itcast.cn”]
11 # 爬虫起始的url
12 start_urls=(“http://www.itcast.cn/channel/teacher.shtml#”,)
13
14 # 解析的方法，每个初始url完成下载将被用
15 def parse(self,response):
16 teacher_list=response.xpath(’//div[@class=“li_txt”]’)
17 # 保存信息的集合
18 teacherItem=[]
19 for each in teacher_list:
20 teacher_name=each.xpath(’./h3/text()’).extract()[0]
21 teacher_title=each.xpath(’./h4/text()’).extract()[0]
22 teacher_info=each.xpath(’./p/text()’).extract()[0]
23
24 item=MyspiderItem()
25 item[‘name’]=teacher_name
26 item[‘title’]=teacher_title
27 item[‘info’]=teacher_info
28
29 teacherItem.append(item)
30
31 return teacherItem
5、执行程序
scrapy crawl 爬虫的识别名称 -o 文件名加后缀

json格式，默认为Unicode编码

scrapy crawl itcast -o teachers.json

json lines格式，默认为Unicode编码

scrapy crawl itcast -o teachers.jsonl

csv 逗号表达式，可用Excel打开

scrapy crawl itcast -o teachers.csv

xml格式

scrapy crawl itcast -o teachers.xml
注意：
parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：

(1). 负责解析返回的网页数据(response.body)，提取结构化数据(生成item)
(2). 生成需要下一页的URL请求。

五、入门案例补充之管道文件
先回顾下yield的作用：

让函数暂停，返回一个值
等待下次调用，从上次停止的地方执行
上面案例是我们用一个集合保存结果然后处理的，下面我们使用管道文件处理。这个案例基于上个案例的代码：

1、修改我们爬虫代码，使用yield

1 # coding:utf-8
2
3 import scrapy
4 from mySpider.items import MyspiderItem
5
6 class ItcastSpider(scrapy.Spider):
7 # 爬虫的识别名称
8 name=“itcast”
9 # 爬虫的域名范围
10 allowd_domains=[“itcast.cn”]
11 # 爬虫起始的url
12 start_urls=(“http://www.itcast.cn/channel/teacher.shtml#”,)
13
14 # 解析的方法，每个初始url完成下载将被用
15 def parse(self,response):
16 teacher_list=response.xpath(’//div[@class=“li_txt”]’)
17 # 保存信息的集合
18 teacherItem=[]
19 for each in teacher_list:
20 teacher_name=each.xpath(’./h3/text()’).extract()[0]
21 teacher_title=each.xpath(’./h4/text()’).extract()[0]
22 teacher_info=each.xpath(’./p/text()’).extract()[0]
23
24 item=MyspiderItem()
25 item[‘name’]=teacher_name
26 item[‘title’]=teacher_title
27 item[‘info’]=teacher_info
28
29 yield item
2、修改项目设置文件

ITEM_PIPELINES = {
‘mySpider.pipelines.ItcastPipeline’: 300,
}
1
2
3
3、编写管道文件mySpider/mySpider/pipelines

1 # -- coding: utf-8 --
2
3 # Define your item pipelines here
4 #
5 # Don’t forget to add your pipeline to the ITEM_PIPELINES setting
6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7
8 import json
9
10 class ItcastPipeline(object):
11 # 初始化方法，根据需要可选
12 def init(self):
13 self.fileName=open(“teacher.json”,“w”)
14
15 # 处理数据,这个方法必须有
16 def process_item(self, item, spider):
17 jsontext=json.dumps(dict(item),ensure_ascii=False)+"\n"
18 self.fileName.write(jsontext.encode(“utf-8”))
19 return item
20
21 # 根据需要可选
22 def close_spider(self,spider):
23 self.fileName.close()
4、执行程序
这个itcast是爬虫代码中的name值

scrapy crawl itcast
1
六、腾讯招聘案例
1、新建项目tencent

scrapy startproject tencent
1
2、编写items.py文件

2
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # http://doc.scrapy.org/en/latest/topics/items.html
7
8 import scrapy
9
10
11 class TencentItem(scrapy.Item):
12 # define the fields for your item here like:
13 # 职位名
14 positionName = scrapy.Field()
15 # 详情链接
16 positionLink=scrapy.Field()
17 # 职位类别
18 positionType=scrapy.Field()
19 # 招聘人数
20 peopleNum=scrapy.Field()
21 # 工作地点
22 workLocation=scrapy.Field()
23 # 发布时间
24 publishTime=scrapy.Field()
3、编写爬虫代码，这里可以用命令生成代码模板
scrapy genspider 爬虫文件名域名

scrapy genspider tencentSpider “tencent.com”
1
1 # -- coding: utf-8 --
2 import scrapy
3 from tencent.items import TencentItem
4
5
6 class TencentspiderSpider(scrapy.Spider):
7 name = “tencentSpider”
8 allowed_domains = [“tencent.com”]
9
10 url=“http://hr.tencent.com/position.php?&start=”
11 offset=0
12 start_urls = [
13 url+str(offset)
14 ]
15
16 def handleResult(self,result):
17 if len(result)>=1:
18 return result.extract()[0]
19 else:
20 return " "
21
22 def parse(self,response):
23 #print ‘----------’
24 for each in response.xpath("//tr[@class=‘even’] | //tr[@class=‘odd’]"):
25
26 # 实例化模型对象
27 item=TencentItem()
28 item[‘positionName’]=self.handleResult(each.xpath("./td[1]/a/text()"))
29 item[‘positionLink’]=self.handleResult(each.xpath("./td[1]/a/@href"))
30 item[‘positionType’]=self.handleResult(each.xpath("./td[2]/text()"))
31 item[‘peopleNum’]=self.handleResult(each.xpath("./td[3]/text()"))
32 item[‘workLocation’]=self.handleResult(each.xpath("./td[4]/text()"))
33 item[‘publishTime’]=self.handleResult(each.xpath("./td[5]/text()"))
34
35 # 将数据交给管道文件处理
36 yield item
37 #print item[‘positionName’]+’----------------’
38
39 if self.offset<500:
40 self.offset+=10
41
42 # 将请求重新发送给调度器入队列，出队列，交给下载器下载
43 yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
4、编写管道文件

1 # -- coding: utf-8 --
2
3 # Define your item pipelines here
4 #
5 # Don’t forget to add your pipeline to the ITEM_PIPELINES setting
6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7
8 import json
9
10 class TencentPipeline(object):
11
12 def init(self):
13 self.filename=open(“tencent.json”,“w”)
14
15 def process_item(self, item, spider):
16 text=json.dumps(dict(item),ensure_ascii=False)+"\n"
17 self.filename.write(text.encode(“utf-8”))
18 return item
19
20 def close_spider(self,spider):
21 self.filename.close()
5、修改项目设置文件

DEFAULT_REQUEST_HEADERS = {
‘User-Agent’:‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;’,
‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,