scrapy逻辑图
创建项目
- 创建一个新的Scrapy项目
scrapy startproject tutorial
:生成项目文件
scrapy genspider name path
:新建一个spider - 该命令将会创建包含下列内容的 tutorial 目录
scrapy.cfg
: 项目的配置文件
tutorial/
: 该项目的python模块。之后您将在此加入代码。
tutorial/items.py
: 项目中的item文件.
tutorial/pipelines.py
: 项目中的pipelines文件.
tutorial/settings.py
: 项目的设置文件.
tutorial/spiders/
: 放置spider代码的目录.
编写Spider
Spider
-
class scrapy.spider.Spider
【标准的Spider类
name
:spider的名字,用于项目的调用
starts_urls
:设置要爬去的主链接列表
allowed_domains
:包含了spider允许爬取的域名列表
pase()
: 解析响应体,默认是Requset(start_url)的callback函数
Request()
:根据url提交请求
start_requests()
: -
Spider的编写
import scrapy
class DmozSpider(scrapy.spiders.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
- 单个回调函数中返回多个Request以及Item
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
sel = scrapy.Selector(response)
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
CrawlSpider
- class scrapy.contrib.spiders.CrawlSpider
【爬取一般网站常用的spider
rules
:一个包含一个(或多个) Rule 对象的集合(list)
parse_start_url(response)
:当start_url的请求返回时,该方法被调用 - CrawlSpider样例
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# 提取匹配 'category.php' (但不匹配 'subsection.php') 的链接并跟进链接(没有callback意味着follow默认为True)
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# 提取匹配 'item.php' 的链接并使用spider的parse_item方法进行分析
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
选择器
xpath()
:返回selector list,包含表达式对应的所有节点
css()
:返回selector list,包含表达式对应的所有节点
extract()
:返回string list,序列化节点为unicode字符串
re()
:返回string list,返回unicode字符串
Shell中调试选择器
scrapy shell 'url'
:进入项目的根目录,shell载入后将得到一个包含response数据的本地 response变量
response.headers
:输出response的包头
response.body
:输出response的包体
response.xpath()
:实际是response.selector.xpath()的缩写
- Shell中用法
In [1]: sel.xpath('//title')
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]
In [2]: sel.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
In [3]: sel.xpath('//title/text()')
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]
In [4]: sel.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: sel.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
- spider加入选择器
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
使用Item
定义
- Item对象
【每一个返回的Item对象可以理解为一行数据
【复制了dict API,仅添加额外属性fields
fields
:包含了item所有声明的字段的字典,而不仅仅是获取到的字段,key是名称,值时Field()对象 - Field对象
【完全是一个字典,是一个字典的别名
import scrapy
class DmozItem(scrapy.Item):#scrapy.item.Item
title = scrapy.Field()#scrapy.item.Field
link = scrapy.Field()
desc = scrapy.Field()
使用
Spider将会将爬取到的数据以 Item 对象返回
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
执行爬取
scrapy crawl dmoz
:shell中启动指定名称项目,将执行name为dmoz的Spider.parse(),实际也是执行Spider.Requset(start_url,callback=parse)
scrapy crawl dmoz -o items.json
:采用 JSON 格式对爬取的数据进行序列化,生成 items.json 文件