scrapy.Selector的使用探索

声明

本文使用的例子来自Scrapy的官方文档,读者可以先行查看:
https://doc.scrapy.org/en/0.14/topics/selectors.html


开始

打开终端,输入:

# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

将会生成如下信息:

2017-04-30 13:51:21 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-30 13:51:21 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2017-04-30 13:51:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-30 13:51:22 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-30 13:51:22 [scrapy.core.engine] INFO: Spider opened
2017-04-30 13:51:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> from <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
2017-04-30 13:51:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger
2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000000049F1C18>
[s]   item       {}
[s]   request    <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   response   <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0x00000000049F1978>
[s]   spider     <DefaultSpider 'default' at 0x4d494a8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

其中罗列了可以使用的对象有:
scrapy模块,crawler,item,request,response,settings,spider
这时我们爬回的网站源代码就是response对象。

In [1]: type(response) #查看response对象
Out[1]: scrapy.http.response.html.HtmlResponse 

response中的xpath方法其实已经可以进行数据提取了,其返回的是SeletorList对象:

In [3]: response.xpath('//a')
Out[3]:
[<Selector xpath='//a' data=u'<a href="image1.html">Name: My image 1 <'>,
 <Selector xpath='//a' data=u'<a href="image2.html">Name: My image 2 <'>,
 <Selector xpath='//a' data=u'<a href="image3.html">Name: My image 3 <'>,
 <Selector xpath='//a' data=u'<a href="image4.html">Name: My image 4 <'>,
 <Selector xpath='//a' data=u'<a href="image5.html">Name: My image 5 <'>]

In [4]: response.xpath('//a/text()')
Out[4]:
[<Selector xpath='//a/text()' data=u'Name: My image 1 '>,
 <Selector xpath='//a/text()' data=u'Name: My image 2 '>,
 <Selector xpath='//a/text()' data=u'Name: My image 3 '>,
 <Selector xpath='//a/text()' data=u'Name: My image 4 '>,
 <Selector xpath='//a/text()' data=u'Name: My image 5 '>]
In [5]: response.xpath('//a/text()').extract()
Out[5]:
[u'Name: My image 1 ',
 u'Name: My image 2 ',
 u'Name: My image 3 ',
 u'Name: My image 4 ',
 u'Name: My image 5 ']

其中SelectorList可以使用正则表达式,返回一个列表:

In [10]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
Out[10]:
[u'My image 1 ',
 u'My image 2 ',
 u'My image 3 ',
 u'My image 4 ',
 u'My image 5 ']

SelectorList对象,即一个Selector列表。Seletor,SeletorList对象都有xpath方法,返回皆为SelectorList或Selector:

In [13]: links=response.xpath('//a[contains(@href, "image")]')
In [15]: for index, link in enumerate(links):   
            args = (index, link.xpath('@href').extract()[0], link.xpath('img/@src').extract()[0])
            print 'Link number %d points to url %s and image %s' % args

返回结果:

Link number 0 points to url image1.html and image image1_thumb.jpg
Link number 1 points to url image2.html and image image2_thumb.jpg
Link number 2 points to url image3.html and image image3_thumb.jpg
Link number 3 points to url image4.html and image image4_thumb.jpg
Link number 4 points to url image5.html and image image5_thumb.jpg

官网上部分API已经不再适用。


xpath语法

一、选取节点
常用的路径表达式:

表达式描述实例
nodename选取nodename节点的所有子节点xpath(‘//div’)选取了div节点的所有子节点
/从根节点选取xpath(‘/div’)从根节点上选取div节点
//选取所有的当前节点,不考虑他们的位置xpath(‘//div’)选取所有的div节点
.选取当前节点xpath(‘./div’)选取当前节点下的div节点
..选取当前节点的父节点xpath(‘..’)回到上一个节点
@选取属性xpath(’//@class’)选取所有的class属性

二、谓语

谓语被嵌在方括号内,用来查找某个特定的节点或包含某个制定的值的节点

实例:

表达式结果
xpath(‘/body/div[1]’)选取body下的第一个div节点
xpath(‘/body/div[last()]’)选取body下最后一个div节点
xpath(‘/body/div[last()-1]’)选取body下倒数第二个div节点
xpath(‘/body/div[positon()<3]’)选取body下前两个div节点
xpath(‘/body/div[@class]’)选取body下带有class属性的div节点
xpath(‘/body/div[@class=”main”]’)选取body下class属性为main的div节点
xpath(‘/body/div[price>35.00]’)选取body下price元素值大于35的div节点

三、通配符

Xpath通过通配符来选取未知的XML元素

表达式结果
xpath(’/div/*’)选取div下的所有子节点
xpath(‘/div[@*]’)选取所有带属性的div节点

四、取多个路径

使用“|”运算符可以选取多个路径

表达式结果
xpath(‘//div|//table’)选取所有的div和table节点

五、Xpath轴

轴可以定义相对于当前节点的节点集

轴名称表达式描述
ancestorxpath(‘./ancestor::*’)选取当前节点的所有先辈节点(父、祖父)
attributexpath(‘./attribute::*’)选取当前节点的所有属性
childxpath(‘./child::*’)返回当前节点的所有子节点
descendantxpath(‘./descendant::*’)返回当前节点的所有后代节点(子节点、孙节点)
followingxpath(‘./following::*’)选取文档中当前节点结束标签后的所有节点
following-sibingxpath(‘./following-sibing::*’)选取当前节点之后的兄弟节点
parentxpath(‘./parent::*’)选取当前节点的父节点
precedingxpath(‘./preceding::*’)选取文档中当前节点开始标签前的所有节点
preceding-siblingxpath(‘./preceding-sibling::*’)选取当前节点之前的兄弟节点
selfxpath(‘./self::*’)选取当前节点

六、功能函数

使用功能函数能够更好的进行模糊搜索

函数用法解释
starts-withxpath(‘//div[starts-with(@id,”ma”)]‘)选取id值以ma开头的div节点
containsxpath(‘//div[contains(@id,”ma”)]‘)选取id值包含ma的div节点
andxpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘)选取id值包含ma和in的div节点
text()xpath(‘//div[contains(text(),”ma”)]‘)选取节点文本包含ma的div节点
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值