Python网络爬虫4 - scrapy入门

该博客首发于www.litreily.top

scrapy作为一款强大的爬虫框架,当然要好好学习一番,本文便是本人学习和使用scrapy过后的一个总结,内容比较基础,算是入门笔记吧,主要讲述scrapy的基本概念和使用方法。

scrapy framework

首先附上scrapy经典图如下:

scrapy框架包含以下几个部分

  1. Scrapy Engine 引擎
  2. Spiders 爬虫
  3. Scheduler 调度器
  4. Downloader 下载器
  5. Item Pipeline 项目管道
  6. Downloader Middlewares 下载器中间件
  7. Spider Middlewares 爬虫中间件

spider process

其爬取过程简述如下:

  1. 引擎从爬虫获取首个待爬取的链接url,并传递给调度器
  2. 调度器将链接存入队列
  3. 引擎向调度器请求要爬取的链接,并将请求得到的链接经下载器中间件传递给下载器
  4. 下载器从网上下载网页,下载后的网页经下载器中间件传递给引擎
  5. 引擎将网页经爬虫中间件传递给爬虫
  6. 爬虫对网页进行解析,将得到的Items和新的链接经爬虫中间件交给引擎
  7. 引擎将从爬虫得到的Items交给项目管道,将新的链接请求requests交给调度器
  8. 此后循环2~7步,直到没有待爬取的链接为止

需要说明的是,项目管道(Item Pipeline)主要完成数据清洗,验证,持久化存储等工作;下载器中间件(Downloader Middlewares)作为下载器和引擎之间的的钩子(hook),用于监听或修改下载请求或已下载的网页,比如修改请求包的头部信息等;爬虫中间件(Spider Middlewares)作为爬虫和引擎之间的钩子(hook),用于处理爬虫的输入输出,即网页response和爬虫解析网页后得到的Itemsrequests

Items

至于什么是Items,个人认为就是经爬虫解析后得到的一个数据单元,包含一组数据,比如爬取的是某网站的商品信息,那么每爬取一个网页可能会得到多组商品信息,每组信息包含商品名称,价格,生产日期,商品样式等,那我们便可以定义一组Item

from scrapy.item import Item
from scrapy.item import Field

class GoodsItem(Item):
    name = Field()
    price = Field()
    date = Field()
    types = Field()
复制代码

Field()实质就是一个字典Dict()类型的扩展,如上代码所示,一组Item对应一个商品信息,单个网页可能包含一个或多个商品,所有Item信息都需要在Spider中赋值,然后经引擎交给Item Pipeline。具体实现在后续博文的实例中会有体现,本文旨在简单记述scrapy的基本概念和使用方法。

Install

with pip

pip install scrapy
复制代码

or conda

conda install -c conda-forge scrapy
复制代码

基本指令如下:

D:\WorkSpace>scrapy --help
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
复制代码

如果需要使用虚拟环境,需要安装virtualenv

pip install virtualenv
复制代码

scrapy startproject

scrapy startproject <project-name> [project-dir]
复制代码

使用该指令可以生成一个新的scrapy项目,以demo为例

$ scrapy startproject demo
...
You can start your first spider with:
    cd demo
    scrapy genspider example example.com

$ cd demo
$ tree
.
├── demo
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files
复制代码

可以看到startproject自动生成了一些文件夹和文件,其中:

  1. scrapy.cfg: 项目配置文件,一般不用修改
  2. items.py: 定义items的文件,例如上述的GoodsItem
  3. middlewares.py: 中间件代码,默认包含下载器中间件和爬虫中间件
  4. pipelines.py: 项目管道,用于处理spider返回的items,包括清洗,验证,持久化等
  5. settings.py: 全局配置文件,包含各类全局变量
  6. spiders: 该文件夹用于存储所有的爬虫文件,注意一个项目可以包含多个爬虫
  7. __init__.py: 该文件指示当前文件夹属于一个python模块
  8. __pycache__: 存储解释器生成的.pyc文件(一种跨平台的字节码byte code),在python2中该类文件与.py保存在相同文件夹

scrapy genspider

项目生成以后,可以使用scrapy genspider指令自动生成一个爬虫文件,比如,如果要爬取花瓣网首页,执行以下指令:

$ cd demo
$ scrapy genspider huaban www.huaban.com
复制代码

默认生成的爬虫文件huaban.py如下:

# -*- coding: utf-8 -*-
import scrapy


class HuabanSpider(scrapy.Spider):
    name = 'huaban'
    allowed_domains = ['www.huaban.com']
    start_urls = ['http://www.huaban.com/']

    def parse(self, response):
        pass
复制代码
  • 爬虫类继承于scrapy.Spider
  • name是必须存在的参数,用以标识该爬虫
  • allowed_domains指代允许爬虫爬取的域名,指定域名之外的链接将被丢弃
  • start_urls存储爬虫的起始链接,该参数是列表类型,所以可以同时存储多个链接

如果要自定义起始链接,也可以重写scrapy.Spider类的start_requests函数,此处不予细讲。

parse函数是一个默认的回调函数,当下载器下载网页后,会调用该函数进行解析,response就是请求包的响应数据。至于网页内容的解析方法,scrapy内置了几种选择器(Selector),包括xpath选择器、CSS选择器和正则匹配。下面是一些选择器的使用示例,方便大家更加直观的了解选择器的用法。

# xpath selector
response.xpath('//a')
response.xpath('./img').extract()
response.xpath('//*[@id="huaban"]').extract_first()
repsonse.xpath('//*[@id="Profile"]/div[1]/a[2]/text()').extract_first()

# css selector
response.css('a').extract()
response.css('#Profile > div.profile-basic').extract_first()
response.css('a[href="test.html"]::text').extract_first()

# re selector
response.xpath('.').re('id:\s*(\d+)')
response.xpath('//a/text()').re_first('username: \s(.*)')
复制代码

需要说明的是,response不能直接调用re,re_first.

scrapy crawl

假设爬虫编写完了,那就可以使用scrapy crawl指令开始执行爬取任务了。

当进入一个创建好的scrapy项目目录时,使用scrapy -h可以获得相比未创建之前更多的帮助信息,其中就包括用于启动爬虫任务的scrapy crawl

$ scrapy -h
Scrapy 1.5.0 - project: huaban

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
复制代码
$ scrapy crawl -h
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items with -o

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
复制代码

scrapy crawl的帮助信息可以看出,该指令包含很多可选参数,但必选参数只有一个,就是spider,即要执行的爬虫名称,对应每个爬虫的名称(name)。

scrapy crawl huaban
复制代码

至此,一个scrapy爬虫任务的创建和执行过程就介绍完了,至于实例,后续博客会陆续介绍。

scrapy shell

最后简要说明一下指令scrapy shell,这是一个交互式的shell,类似于命令行形式的python,当我们刚开始学习scrapy或者刚开始爬虫某个陌生的站点时,可以使用它熟悉各种函数操作或者选择器的使用,用它来不断试错纠错,熟练掌握scrapy各种用法。

$ scrapy shell www.huaban.com
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3
2017, 17:26:49) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-05-29 23:58:49 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2018-05-29 23:58:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-29 23:58:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-29 23:58:50 [scrapy.core.engine] INFO: Spider opened
2018-05-29 23:58:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://huaban.com/> from <GET http://www.huaban.com>
2018-05-29 23:58:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://huaban.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x03385CB0>
[s]   item       {}
[s]   request    <GET http://www.huaban.com>
[s]   response   <200 http://huaban.com/>
[s]   settings   <scrapy.settings.Settings object at 0x04CC4D10>
[s]   spider     <DefaultSpider 'default' at 0x4fa6bf0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: view(response)
Out[1]: True

In [2]: response.xpath('//a')
Out[2]:
[<Selector xpath='//a' data='<a id="elevator" class="off" onclick="re'>,
 <Selector xpath='//a' data='<a class="plus"></a>'>,
 <Selector xpath='//a' data='<a onclick="app.showUploadDialog();">添加采'>,
 <Selector xpath='//a' data='<a class="add-board-item">添加画板<i class="'>,
 <Selector xpath='//a' data='<a href="/about/goodies/">安装采集工具<i class'>,
 <Selector xpath='//a' data='<a class="huaban_security_oauth" logo_si'>]

In [3]: response.xpath('//a').extract()
Out[3]:
['<a id="elevator" class="off" onclick="return false;" title="回到顶部"></a>',
 '<a class="plus"></a>',
 '<a onclick="app.showUploadDialog();">添加采集<i class="upload"></i></a>',
 '<a class="add-board-item">添加画板<i class="add-board"></i></a>',
 '<a href="/about/goodies/">安装采集工具<i class="goodies"></i></a>',
 '<a class="huaban_security_oauth" logo_size="124x47" logo_type="realname" href="//www.anquan.org" rel="nofollow"> <script src="//static.anquan.org/static/outer/js/aq_auth.js"></script> </a>']

In [4]: response.xpath('//img')
Out[4]: [<Selector xpath='//img' data='<img src="https://d5nxst8fruw4z.cloudfro'>]

In [5]: response.xpath('//a/text()')
Out[5]:
[<Selector xpath='//a/text()' data='添加采集'>,
 <Selector xpath='//a/text()' data='添加画板'>,
 <Selector xpath='//a/text()' data='安装采集工具'>,
 <Selector xpath='//a/text()' data=' '>,
 <Selector xpath='//a/text()' data=' '>]

In [6]: response.xpath('//a/text()').extract()
Out[6]: ['添加采集', '添加画板', '安装采集工具', ' ', ' ']

In [7]: response.xpath('//a/text()').extract_first()
Out[7]: '添加采集'
复制代码
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值