scrapy实战一

scrapy是什么?

“Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。”–官方说法。
个人理解:爬取网页数据,并将抓到的数据结构化,你只需关心你自己的爬取逻辑和页面数据的提取逻辑,其他的事情,框架都帮你做了。

安装scrapy

yum -y update
yum groupinstall -y development
yum install -y zlib-dev openssl-devel sqlite-devel bzip2-devel libffi-devel python-devel libxslt-devel
cd /pkg
wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gz
tar -xvf setuptools-1.4.2.tar.gz
cd setuptools-1.4.2
python setup.py install
curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python -
pip install scrapy

执行scrapy version

[root@jianzhi-dev ~]# scrapy version
2015-12-15 09:04:30 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-12-15 09:04:30 [scrapy] INFO: Optional features available: ssl, http11
2015-12-15 09:04:30 [scrapy] INFO: Overridden settings: {}

看到类似上面的代码,说明装好了。

框架结构简介

执行scrapy startproject pn,其中pn是项目名,你可以根据你的实际项目取你自己的名字,执行完以后,在当前目录下,会生成pn目录,目录结构如下
.
├── pn
│ ├── init.py
│ ├── items.py //定义数据结构的地方,你抓取页面,将数据结构化,就用到它
│ ├── pipelines.py //抓取最后变成的item,都会经过pipelines处理,比方说将item保存到mysql
│ ├── settings.py //系统配置,比如控制抓取的速度等
│ └── spiders //写抓取逻辑的地方
│ └── init.py
└── scrapy.cfg

开始第一个简单的例子

该例子完成的任务是抓取搜索刘德华后的百科页面中得url和标题

定义数据结构,在items.py文件里

class PnItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()

创建自己的爬去逻辑类

cd spiders/
vim pn_spider.py

# -*- coding: UTF-8 -*-
import scrapy
from pn.items import PnItem

class PnSpider(scrapy.spiders.Spider):
    name = "pn"
    start_urls = [
        "http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8"
    ]

    def parse(self, response):
        for sel in response.xpath("//dl[@class='search-list']/dd"):
            item = PnItem()
            item['title'] = sel.xpath('a/text()').extract()[0]
            item['url'] = sel.xpath('a/@href').extract()[0]
            yield item

保存之后,执行scrapy crawl pn进行爬去作业,得到如下的爬去结果

2015-12-15 10:04:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: pn)
2015-12-15 10:04:23 [scrapy] INFO: Optional features available: ssl, http11
2015-12-15 10:04:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pn.spiders', 'SPIDER_MODULES': ['pn.spiders'], 'BOT_NAME': 'pn'}
2015-12-15 10:04:49 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-15 10:04:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-15 10:04:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-15 10:04:49 [scrapy] INFO: Enabled item pipelines:
2015-12-15 10:04:49 [scrapy] INFO: Spider opened
2015-12-15 10:04:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-15 10:04:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-15 10:04:55 [scrapy] DEBUG: Crawled (200) <GET http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8> (referer: None)
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/subview/1758/18233157.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u6768\u4e3d\u5a1f(',
 'url': u'http://baike.baidu.com/subview/872134/8550376.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u56db\u5927\u5929\u738b(\u9999\u6e2f\u56db\u5927\u5929\u738b)_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/subview/20129/5747579.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u6f14\u5531\u4f1a99_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/view/757747.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'ALways', 'url': u'http://baike.baidu.com/view/10726576.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u5144\u5f1f\u4e4b\u751f\u6b7b\u540c\u76df_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/view/1182768.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u6797\u5bb6\u680b_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/view/19592.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u6768\u4e3d\u5a1f\u4e8b\u4ef6_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/view/1047445.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u5200\u5251\u7b11(1994\u5e74\u9ec4\u6cf0\u6765\u6267\u5bfc\u7535\u5f71)_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/subview/1064013/6839067.htm'}
2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>
{'title': u'\u81f3\u5c0a\u65e0\u4e0a\u2161\u4e4b\u6c38\u9738\u5929\u4e0b_\u767e\u5ea6\u767e\u79d1',
 'url': u'http://baike.baidu.com/view/3908825.htm'}
2015-12-15 10:04:55 [scrapy] INFO: Closing spider (finished)
2015-12-15 10:04:55 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 281,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6822,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 15, 2, 4, 55, 242394),
 'item_scraped_count': 10,
 'log_count/DEBUG': 12,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 12, 15, 2, 4, 49, 176917)}
2015-12-15 10:04:55 [scrapy] INFO: Spider closed (finished)

程序详解

class PnSpider(scrapy.spiders.Spider):
定义爬虫类

name = “pn”
给这个爬虫命名,和执行scrapy crawl pn时,名字对应

start_urls
初始爬取的页面

def parse(self, response):
爬取完页面,回调的方法,在这里面对页面进行处理

“//dl[@class=’search-list’]/dd”
这是xpath的语法,可以自己了解下xpath,简单来说,就是页面查询语言,通过这种方式找到你要的数据

item[‘title’] = sel.xpath(‘a/text()’).extract()[0]
item[‘url’] = sel.xpath(‘a/@href’).extract()[0]
将你要得数据,保存到你之前设定好的数据结构里

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值