scrapy是一个快速爬虫的web抓取框架,用于抓取web站点并从页面中提取结构化的数据。scrapy用途广泛,而且用起来非常方便,下面以抓取北邮人论坛十大热门话题为例,来讲解一下scrapy的基本用法。
1.创建项目
在开始爬虫之前,必须要先创建一个scrapy项目。进入到存储代码的位置,运行以下命令:
scrapy startprojectbyr
该命令将会创建以下目录:
byr/
scrapy.cfg
byr/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
其中,scrapy.cfg是项目的配置文件,byr/items.py是用来定义爬虫的字段,byr/pipelines.py是管道文件,当item在被spider收集之后,就会被收集到pipeline中,主要实现验证、查重、储存爬虫内容的功能。byr/settings.py是项目的设置文件,byr/spiders/是放置爬虫代码的目录。
2.定义items
在items.py中定义要爬取的网站字段,本文示例是要爬取北邮人论坛十大热门话题,则要爬取的字段为标题:title,URL:link,作者:author,发帖时间:pubDate。代码如下:
import scrapy
class ByrItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
broad = scrapy.Field()
link = scrapy.Field()
author = scrapy.Field()
pubDate = scrapy.Field()
3.编写爬虫
为了创建一个Spider,必须要继承 scrapy.Spider 类, 且定义以下三个属性:
name: 用于区别Spider,该名字必须是唯一的。
start_urls: 包含了Spider在启动时进行爬取的url列表。
parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
代码示例如下:
# -*- coding:utf-8 -*-
import scrapy
import sys
import string
from byr.items import ByrItem
class TopTenSpider(scrapy.spiders.Spider):
'''
@start_urls:论坛十大地址
@en_file_path:论坛版块英文名
@ch_file_path:论坛版块中文名
'''
name = "topten"
start_urls = [
"http://bbs.byr.cn/rss/topten"
]
en_file_path = "en_broad.bat"
ch_file_path = "ch_broad.bat"
def parse(self,response):
'''
用来获取网页信息的参数
@title:十大标题
@broad:十大板块
@link:帖子链接
@author:帖子作者
@pubDate:发帖时间
'''
for sel in response.xpath('//item'):
item = ByrItem()
title = sel.xpath('title/text()').extract()
link = sel.xpath('link/text()').extract()
author = sel.xpath('author/text()').extract()
pubDate = sel.xpath('pubDate/text()').extract()
item['title'] = [n.encode('utf-8') for n in title]
item['broad'] = [self.get_broad(link)]
item['link'] = link
item['author'] = author
item['pubDate'] = pubDate
yield item
def get_broad(self,link):
'''
得到十大帖子的版面
'''
en_broad_list = self.get_broad_list(self.en_file_path)
ch_broad_list = self.get_broad_list(self.ch_file_path)
broad = link[0].split('/')[-2].strip().lower()
return ch_broad_list[en_broad_list.index(broad)]
def get_broad_list(self,file_path):
'''
从文件中读取论坛版面列表
'''
broad_list = []
f = open(file_path)
for line in f:
for name in line.split(','):
broad_list.append(name.strip())
f.close()
return broad_list
4.保存爬虫结果
保存爬虫结果的功能是在pipelines.py文件中实现的,以下是将爬取的item保存到json文件的代码示例:
import json
import codecs
import time
class ByrPipeline(object):
def __init__(self):
self.date = time.strftime('%Y%m%d',time.localtime(time.time()))
self.file_path = 'result/topten_' + self.date + '.json'
self.file = codecs.open(self.file_path, 'wb', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line.decode("unicode_escape"))
return item
为了启用一个item pipeline组件,必须将其添加到项目的设置文件中,在setting.py文件中添加如下代码:
ITEM_PIPELINES = {
'byr.pipelines.ByrPipeline':300
}
5.执行爬虫
在命令行输入以下命令:scrapy crawl topten
就可以得到爬取结果的json文件。结果如下:
{"broad":["家庭生活"], "title": ["可气的女票家要10万彩礼+房子+不能和父母同住+不能把户口迁过去"], "link":["http://bbs.byr.cn/article/FamilyLife/122876"], "pubDate":["Tue, 22 Dec 2015 13:25:53 GMT"], "author":["bigzhao"]}
{"broad":["心理健康在线"], "title": ["关于今日十大头条日语系学长的言论,不能忍了"], "link":["http://bbs.byr.cn/article/PsyHealthOnline/52301"],"pubDate": ["Tue, 22 Dec 2015 13:18:12 GMT"],"author": ["urootya"]}
{"broad":["缘来如此"], "title": ["【王道】帮两位湖南美女闺蜜征靠谱男友啦"], "link":["http://bbs.byr.cn/article/Friends/1710291"], "pubDate":["Tue, 22 Dec 2015 13:05:58 GMT"], "author":["cby333333"]}
{"broad":["贴图秀"], "title": ["情暖伊冬·伊冬迹忆"], "link":["http://bbs.byr.cn/article/Picture/3124944"], "pubDate":["Tue, 22 Dec 2015 12:36:36 GMT"], "author": ["saveme1018"]}
{"broad":["情感的天空"], "title": ["为什么有些女生条件不好也不接受追求非要等到27,8岁再出来征友"], "link":["http://bbs.byr.cn/article/Feeling/2849962"], "pubDate":["Tue, 22 Dec 2015 13:26:47 GMT"], "author":["liufeier"]}
{"broad":["吉他"], "title": ["【乐队】第十九届北邮摇滚夜 蓄势待发!"], "link":["http://bbs.byr.cn/article/Guitar/149172"], "pubDate":["Tue, 22 Dec 2015 13:24:53 GMT"], "author":["zfy2014"]}
{"broad":["电影"], "title": ["觉得寻龙诀 不错呀"], "link":["http://bbs.byr.cn/article/Movie/307258"], "pubDate":["Tue, 22 Dec 2015 13:06:56 GMT"], "author":["DD418"]}
{"broad":["学习交流区"], "title": ["立帖为证,盲审不中!"], "link":["http://bbs.byr.cn/article/StudyShare/165378"], "pubDate":["Tue, 22 Dec 2015 11:53:49 GMT"], "author":["dbdb"]}
{"broad":["电脑硬件与维修"], "title": ["高性能+便携解决方案"], "link": ["http://bbs.byr.cn/article/HardWare/210977"],"pubDate": ["Tue, 22 Dec 2015 12:48:27 GMT"],"author": ["sharonyue"]}
{"broad":["考研专版"], "title": ["肖四完全记不住怎么破"], "link":["http://bbs.byr.cn/article/AimGraduate/1039268"],"pubDate": ["Tue, 22 Dec 2015 13:14:03 GMT"],"author": ["chen0yi"]}