CrawlSpider相比于scrapy的强大之处就是之前的爬虫如果爬完一页了要去爬取第二页的数据需要自己yield发送请求过去,而CrawlSpider就只需要指定一些规则,满足规则的url就去下载,不满足的就不下载。
创建CrawlSpider需要用scrapy genspider -t crawl [爬虫名] [域名]这个命令。
首先打开:http://www.wxapp-union.com/:
然后创建项目: scrapy startproject wxapp、scrapy genspider -t crawl wxapp_spider 'wxapp-union.com'
进入到文件中发现相比不用Crawl有了很大的不同。
项目创建完毕了就要开始爬取数据了,首先还是要分析网页:http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1
不同的页面只是最后的数字不同而以,然后再进入一篇详情页:http://www.wxapp-union.com/article-4863-1.html,再点进去一篇:
http://www.wxapp-union.com/article-4862-1.html,发现一个是4863,一个是4862,找到规律以后就可以写LinkExtractor了。
然后修改一下起始url:
class WxappSpiderSpider(CrawlSpider):
name = 'wxapp_spider'
allowed_domains = ['wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
它自动给我们生产了rules这样一个属性,以后的话,CrawlSpider就会根据LinkExtractor来提取相应的url,现在就是要提取所有的列表页,这个allow就是允许爬取哪些url,现在将它改为:allow=r'http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=\d'
callback代表的就是如果请求到这些页面以后该如何去解析,采用哪个函数去解析。而这里的话是不需要用它的,因为现在是让它自动的去爬取,如果需要的话那么它的作用就是拿到一页中所有教程的url,而这个工作CrawlSpider已经帮我们自动完成了。
follow代表的是在爬取这个url的时候如果发现类似的url是否需要跟进,因此要为True。然后我们再加一条规则:
Rule(LinkExtractor(allow=r".+article-.+1\.html"),callback="parse_detail",follow=False)
表示可以爬取详情页,需要自己来处理,并且不需要跟进。
我们先来获取一下标题:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class WxappSpiderSpider(CrawlSpider):
name = 'wxapp_spider'
allowed_domains = ['wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
rules = (
Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True),
Rule(LinkExtractor(allow=r".+article.+1\.html"),callback="parse_detail",follow=False)
)
def parse_detail(self, response):
print(response.xpath('//h1[@class="ph"]/text()').get())
然后来更改一下解释器:
运行成功:
然后就可以去解析数据了:
def parse_detail(self, response):
title = response.xpath('//h1[@class="ph"]/text()').get()
author = response.xpath('//a[@href="space-uid-17761.html"]/text()').get()
time = response.xpath('//span[@class="time"]/text()').get()
artical_content = response.xpath('//td[@id="article_content"]//text()').getall()
artical_content = ''.join(artical_content).strip()
print(artical_content)
print('='*100)
现在已经拿到了所有我们想要的数据,接下来就要去存储到json了。
在存储之前先定义好字段:
class WxappItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
time = scrapy.Field()
content = scrapy.Field()
然后真正开始存储:
class WxappPipeline(object):
def __init__(self):
self.fp = open('wxjc.json','wb')
self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self,spider):
self.fp.close()
最终代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
class WxappSpiderSpider(CrawlSpider):
name = 'wxapp_spider'
allowed_domains = ['wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
rules = (
Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True),
Rule(LinkExtractor(allow=r".+article.+1\.html"),callback="parse_detail",follow=False)
)
def parse_detail(self, response):
title = response.xpath('//h1[@class="ph"]/text()').get()
author = response.xpath('//a[@href="space-uid-17761.html"]/text()').get()
time = response.xpath('//span[@class="time"]/text()').get()
artical_content = response.xpath('//td[@id="article_content"]//text()').getall()
artical_content = ''.join(artical_content).strip()
item = WxappItem(title=title,author=author,time=time,content=artical_content)
yield item