Scrapy 框架介绍与简单案例

最新推荐文章于 2024-08-05 10:37:56 发布

gxh_apologize

最新推荐文章于 2024-08-05 10:37:56 发布

阅读量2.7k

点赞数

分类专栏： Python学习笔记

本文链接：https://blog.csdn.net/GXH_APOLOGIZE/article/details/79010585

版权

Python学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、Scrapy介绍

Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛
Scrapy 使用了 Twisted(其主要对手是Tornado)异步网络框架来处理网络通讯，可以加快我们的下载速度，不用自己去实现异步框架，并且包含了各种中间件接口，可以灵活的完成各种需求
官网：http://doc.scrapy.org/en/latest
中文维护站点：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

二、Scrapy安装

参考文档：https://doc.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes

我是在Ubuntu16.04下安装的：
- 安装非Python依赖：

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

通过pip安装：

sudo pip install scrapy

三、Scrapy项目目录介绍

我们先建一个Scrapy项目，项目名teshi：

scrapy startproject teshi

项目目录结构如下：

scrapy.cfg ：项目的配置文件
teshi/ ：项目的Python模块，将会从这里引用代码
teshi/items.py ：项目的目标文件
teshi/pipelines.py ：项目的管道文件
teshi/settings.py ：项目的设置文件
teshi/spiders/ ：存储爬虫代码目录

四、项目流程与第一个入门案例

这个案例是爬取某机构官网教师信息，通过这个案例可以对Scrapy框架使用有一个基本了解。
1、新建项目mySpider

scrapy startproject mySpider

2、修改项目设置文件mySpider/settings.py

DEFAULT_REQUEST_HEADERS = {
         'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 #   'Accept-Language': 'en',
 }

3、修改项目目标文件mySpider/items.py，定义结构化数据字段，用来保存爬取到的数据

1 # -*- coding: utf-8 -*-
  2 
  3 # Define here the models for your scraped items
  4 #
  5 # See documentation in:
  6 # http://doc.scrapy.org/en/latest/topics/items.html
  7 
  8 import scrapy
  9 
 10 
 11 class MyspiderItem(scrapy.Item):
 12     # define the fields for your item here like:
 13 
 14     name = scrapy.Field()
 15     title=scrapy.Field()
 16     info=scrapy.Field()
 17     #pass

4、编写爬虫代码，在mySpider/mySpider/spiders目录下新建itcastSpider.py（可以命令创建，后期介绍，这里手动创建吧）

1 # coding:utf-8
  2 
  3 import scrapy
  4 from mySpider.items import MyspiderItem
  5 
  6 class ItcastSpider(scrapy.Spider):
  7     # 爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字
  8     name="itcast"
  9     # 爬虫的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略
 10     allowd_domains=["itcast.cn"]
 11     # 爬虫起始的url
 12     start_urls=("http://www.itcast.cn/channel/teacher.shtml#",)
 13 
 14     # 解析的方法，每个初始url完成下载将被用
 15     def parse(self,response):
 16         teacher_list=response.xpath('//div[@class="li_txt"]')
 17         # 保存信息的集合
 18         teacherItem=[]
 19         for each in teacher_list:
 20             teacher_name=each.xpath('./h3/text()').extract()[0]
 21             teacher_title=each.xpath('./h4/text()').extract()[0]
 22             teacher_info=each.xpath('./p/text()').extract()[0]
 23 
 24             item=MyspiderItem()
 25             item['name']=teacher_name
 26             item['title']=teacher_title
 27             item['info']=teacher_info
 28 
 29             teacherItem.append(item)
 30 
 31         return teacherItem

5、执行程序
scrapy crawl 爬虫的识别名称 -o 文件名加后缀

# json格式，默认为Unicode编码
scrapy crawl itcast -o teachers.json

# json lines格式，默认为Unicode编码
scrapy crawl itcast -o teachers.jsonl

# csv 逗号表达式，可用Excel打开
scrapy crawl itcast -o teachers.csv

# xml格式
scrapy crawl itcast -o teachers.xml

注意：
parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：

(1). 负责解析返回的网页数据(response.body)，提取结构化数据(生成item)
(2). 生成需要下一页的URL请求。

五、入门案例补充之管道文件

先回顾下yield的作用：

让函数暂停，返回一个值
等待下次调用，从上次停止的地方执行

上面案例是我们用一个集合保存结果然后处理的，下面我们使用管道文件处理。这个案例基于上个案例的代码：

1、修改我们爬虫代码，使用yield

1 # coding:utf-8
  2 
  3 import scrapy
  4 from mySpider.items import MyspiderItem
  5 
  6 class ItcastSpider(scrapy.Spider):
  7     # 爬虫的识别名称
  8     name="itcast"
  9     # 爬虫的域名范围
 10     allowd_domains=["itcast.cn"]
 11     # 爬虫起始的url
 12     start_urls=("http://www.itcast.cn/channel/teacher.shtml#",)
 13 
 14     # 解析的方法，每个初始url完成下载将被用
 15     def parse(self,response):
 16         teacher_list=response.xpath('//div[@class="li_txt"]')
 17         # 保存信息的集合
 18         teacherItem=[]
 19         for each in teacher_list:
 20             teacher_name=each.xpath('./h3/text()').extract()[0]
 21             teacher_title=each.xpath('./h4/text()').extract()[0]
 22             teacher_info=each.xpath('./p/text()').extract()[0]
 23 
 24             item=MyspiderItem()
 25             item['name']=teacher_name
 26             item['title']=teacher_title
 27             item['info']=teacher_info
 28 
 29             yield item

2、修改项目设置文件

ITEM_PIPELINES = {
     'mySpider.pipelines.ItcastPipeline': 300,
 }

3、编写管道文件mySpider/mySpider/pipelines

1 # -*- coding: utf-8 -*-
  2 
  3 # Define your item pipelines here
  4 #
  5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  7 
  8 import json
  9 
 10 class ItcastPipeline(object):
 11     # 初始化方法，根据需要可选
 12     def __init__(self):
 13         self.fileName=open("teacher.json","w")
 14 
 15     # 处理数据,这个方法必须有    
 16     def process_item(self, item, spider):
 17         jsontext=json.dumps(dict(item),ensure_ascii=False)+"\n"
 18         self.fileName.write(jsontext.encode("utf-8"))
 19         return item
 20 
 21     # 根据需要可选
 22     def close_spider(self,spider):
 23         self.fileName.close()

4、执行程序
这个itcast是爬虫代码中的name值

scrapy crawl itcast

六、腾讯招聘案例

1、新建项目tencent

scrapy startproject tencent

2、编写items.py文件

2 
  3 # Define here the models for your scraped items
  4 #
  5 # See documentation in:
  6 # http://doc.scrapy.org/en/latest/topics/items.html
  7 
  8 import scrapy
  9 
 10 
 11 class TencentItem(scrapy.Item):
 12     # define the fields for your item here like:
 13     # 职位名
 14     positionName = scrapy.Field()
 15     # 详情链接
 16     positionLink=scrapy.Field()
 17     # 职位类别
 18     positionType=scrapy.Field()
 19     # 招聘人数
 20     peopleNum=scrapy.Field()
 21     # 工作地点
 22     workLocation=scrapy.Field()
 23     # 发布时间
 24     publishTime=scrapy.Field()

3、编写爬虫代码，这里可以用命令生成代码模板
scrapy genspider 爬虫文件名域名

scrapy genspider tencentSpider "tencent.com"

1 # -*- coding: utf-8 -*-
  2 import scrapy
  3 from tencent.items import TencentItem
  4 
  5 
  6 class TencentspiderSpider(scrapy.Spider):
  7     name = "tencentSpider"
  8     allowed_domains = ["tencent.com"]
  9 
 10     url="http://hr.tencent.com/position.php?&start="
 11     offset=0
 12     start_urls = [
 13         url+str(offset)
 14     ]
 15 
 16     def handleResult(self,result):
 17         if len(result)>=1:
 18             return result.extract()[0]
 19         else:
 20             return " "
 21 
 22     def parse(self,response):
 23         #print '----------'
 24         for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
 25 
 26             # 实例化模型对象
 27             item=TencentItem()
 28             item['positionName']=self.handleResult(each.xpath("./td[1]/a/text()"))
 29             item['positionLink']=self.handleResult(each.xpath("./td[1]/a/@href"))
 30             item['positionType']=self.handleResult(each.xpath("./td[2]/text()"))
 31             item['peopleNum']=self.handleResult(each.xpath("./td[3]/text()"))
 32             item['workLocation']=self.handleResult(each.xpath("./td[4]/text()"))
 33             item['publishTime']=self.handleResult(each.xpath("./td[5]/text()"))
 34 
 35             # 将数据交给管道文件处理
 36             yield item
 37             #print item['positionName']+'----------------'
 38 
 39         if self.offset<500:
 40             self.offset+=10
 41 
 42         # 将请求重新发送给调度器入队列，出队列，交给下载器下载
 43         yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

4、编写管道文件

1 # -*- coding: utf-8 -*-
  2 
  3 # Define your item pipelines here
  4 #
  5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  7 
  8 import json
  9 
 10 class TencentPipeline(object):
 11 
 12     def __init__(self):
 13         self.filename=open("tencent.json","w")
 14 
 15     def process_item(self, item, spider):
 16         text=json.dumps(dict(item),ensure_ascii=False)+"\n"
 17         self.filename.write(text.encode("utf-8"))
 18         return item
 19 
 20     def close_spider(self,spider):
 21         self.filename.close()

5、修改项目设置文件

DEFAULT_REQUEST_HEADERS = {
         'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 #   'Accept-Language': 'en',
 }

 ####省略
 DOWNLOAD_DELAY = 3
 ####省略
  ITEM_PIPELINES = {
     'tencent.pipelines.TencentPipeline': 300,
 }

6、执行代码