一、选择一个网站
假设要从Mininova网站 中提取所有今天添加的文件的url,name,description和size
网址为 http://www.mininova.org/today
二、定义数据
定义要抓取的数据,通过 Scrapy Items 来实现
例子:(BT文件--bit torrent,比特洪流)
【Python】
- from scrapy.item import Item, Field
- class TorrentItem(Item):
- url = Field()
- name = Field()
- description = Field()
- size = Field()
三、撰写蜘蛛
1、查看初始网址的源代码
2、查找url的规律(例子:http://www.mininova.org/tor/+数字,可以利用正则表达式 "/tor/\d+" 来提取所有文件的url地址)
3、构建一个Xpath去选择我们需要的数据,name, description 和size
【HTML 源码】
- <h1>Darwin - The Evolution Of An Exhibition</h1>
- <h2>Description:</h2>
- <div id="description">
- Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
- ...
- <div id="specifications">
- <p>
- <strong>Category:</strong>
- <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a>
- </p>
-
- <p>
- <strong>Total size:</strong>
- 150.62 megabyte</p>
从上面代码中,可以发现name在<h1>里面
它的Xpath表达式为://h1/text()
description在id="description"的div标签里
它的Xpath表达式为://div[@id='description']
size它在id="specifications"的div标签中的第2个p标签里
它的Xpath表达式为://div[@id='specification']/p[2]/text()[2]
最后,爬虫的代码如下(python)
- from scrapy.contrib.spiders import CrawlSpider, Rule
- from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
- from scrapy.selector import Selector
- class MininovaSpider(CrawlSpider):
- name = 'mininova'
- allowed_domains = ['mininova.org']
- start_urls = ['http://www.mininova.org/today']
- rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
- def parse_torrent(self, response):
- sel = Selector(response)
- torrent = TorrentItem()
- torrent['url'] = response.url
- torrent['name'] = sel.xpath("//h1/text()").extract()
- torrent['description'] = sel.xpath("//div[@id='description']").extract()
- torrent['size'] = sel.xpath("//div[@id='specification']/p[2]/text()[2]").extract()
- return torrent
四、执行爬虫提取数据
将爬取得到的数据,以json格式保存到scraped_data.json文件中
scrapy crawl mininova -o scraped_data.json -t json
这里用feed export来生成json文件
【Scrapy自带了Feed输出,并且支持多种序列化格式(serialization format)及存储方式(storage backends)。】
五、回顾抓取数据
Selectors 返回的是一个列表(lists)