声明
本文使用的例子来自Scrapy的官方文档,读者可以先行查看:
https://doc.scrapy.org/en/0.14/topics/selectors.html
开始
打开终端,输入:
# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
将会生成如下信息:
2017-04-30 13:51:21 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-04-30 13:51:21 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2017-04-30 13:51:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-30 13:51:22 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-04-30 13:51:22 [scrapy.core.engine] INFO: Spider opened
2017-04-30 13:51:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> from <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
2017-04-30 13:51:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger
2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x00000000049F1C18>
[s] item {}
[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <scrapy.settings.Settings object at 0x00000000049F1978>
[s] spider <DefaultSpider 'default' at 0x4d494a8>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
其中罗列了可以使用的对象有:
scrapy模块,crawler,item,request,response,settings,spider
这时我们爬回的网站源代码就是response对象。
In [1]: type(response) #查看response对象
Out[1]: scrapy.http.response.html.HtmlResponse
response中的xpath方法其实已经可以进行数据提取了,其返回的是SeletorList对象:
In [3]: response.xpath('//a')
Out[3]:
[<Selector xpath='//a' data=u'<a href="image1.html">Name: My image 1 <'>,
<Selector xpath='//a' data=u'<a href="image2.html">Name: My image 2 <'>,
<Selector xpath='//a' data=u'<a href="image3.html">Name: My image 3 <'>,
<Selector xpath='//a' data=u'<a href="image4.html">Name: My image 4 <'>,
<Selector xpath='//a' data=u'<a href="image5.html">Name: My image 5 <'>]
In [4]: response.xpath('//a/text()')
Out[4]:
[<Selector xpath='//a/text()' data=u'Name: My image 1 '>,
<Selector xpath='//a/text()' data=u'Name: My image 2 '>,
<Selector xpath='//a/text()' data=u'Name: My image 3 '>,
<Selector xpath='//a/text()' data=u'Name: My image 4 '>,
<Selector xpath='//a/text()' data=u'Name: My image 5 '>]
In [5]: response.xpath('//a/text()').extract()
Out[5]:
[u'Name: My image 1 ',
u'Name: My image 2 ',
u'Name: My image 3 ',
u'Name: My image 4 ',
u'Name: My image 5 ']
其中SelectorList可以使用正则表达式,返回一个列表:
In [10]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
Out[10]:
[u'My image 1 ',
u'My image 2 ',
u'My image 3 ',
u'My image 4 ',
u'My image 5 ']
SelectorList对象,即一个Selector列表。Seletor,SeletorList对象都有xpath方法,返回皆为SelectorList或Selector:
In [13]: links=response.xpath('//a[contains(@href, "image")]')
In [15]: for index, link in enumerate(links):
args = (index, link.xpath('@href').extract()[0], link.xpath('img/@src').extract()[0])
print 'Link number %d points to url %s and image %s' % args
返回结果:
Link number 0 points to url image1.html and image image1_thumb.jpg
Link number 1 points to url image2.html and image image2_thumb.jpg
Link number 2 points to url image3.html and image image3_thumb.jpg
Link number 3 points to url image4.html and image image4_thumb.jpg
Link number 4 points to url image5.html and image image5_thumb.jpg
官网上部分API已经不再适用。
xpath语法
一、选取节点
常用的路径表达式:
表达式 | 描述 | 实例 | |
---|---|---|---|
nodename | 选取nodename节点的所有子节点 | xpath(‘//div’) | 选取了div节点的所有子节点 |
/ | 从根节点选取 | xpath(‘/div’) | 从根节点上选取div节点 |
// | 选取所有的当前节点,不考虑他们的位置 | xpath(‘//div’) | 选取所有的div节点 |
. | 选取当前节点 | xpath(‘./div’) | 选取当前节点下的div节点 |
.. | 选取当前节点的父节点 | xpath(‘..’) | 回到上一个节点 |
@ | 选取属性 | xpath(’//@class’) | 选取所有的class属性 |
二、谓语
谓语被嵌在方括号内,用来查找某个特定的节点或包含某个制定的值的节点
实例:
表达式 | 结果 |
---|---|
xpath(‘/body/div[1]’) | 选取body下的第一个div节点 |
xpath(‘/body/div[last()]’) | 选取body下最后一个div节点 |
xpath(‘/body/div[last()-1]’) | 选取body下倒数第二个div节点 |
xpath(‘/body/div[positon()<3]’) | 选取body下前两个div节点 |
xpath(‘/body/div[@class]’) | 选取body下带有class属性的div节点 |
xpath(‘/body/div[@class=”main”]’) | 选取body下class属性为main的div节点 |
xpath(‘/body/div[price>35.00]’) | 选取body下price元素值大于35的div节点 |
三、通配符
Xpath通过通配符来选取未知的XML元素
表达式 | 结果 |
---|---|
xpath(’/div/*’) | 选取div下的所有子节点 |
xpath(‘/div[@*]’) | 选取所有带属性的div节点 |
四、取多个路径
使用“|”运算符可以选取多个路径
表达式 | 结果 |
---|---|
xpath(‘//div|//table’) | 选取所有的div和table节点 |
五、Xpath轴
轴可以定义相对于当前节点的节点集
轴名称 | 表达式 | 描述 |
---|---|---|
ancestor | xpath(‘./ancestor::*’) | 选取当前节点的所有先辈节点(父、祖父) |
attribute | xpath(‘./attribute::*’) | 选取当前节点的所有属性 |
child | xpath(‘./child::*’) | 返回当前节点的所有子节点 |
descendant | xpath(‘./descendant::*’) | 返回当前节点的所有后代节点(子节点、孙节点) |
following | xpath(‘./following::*’) | 选取文档中当前节点结束标签后的所有节点 |
following-sibing | xpath(‘./following-sibing::*’) | 选取当前节点之后的兄弟节点 |
parent | xpath(‘./parent::*’) | 选取当前节点的父节点 |
preceding | xpath(‘./preceding::*’) | 选取文档中当前节点开始标签前的所有节点 |
preceding-sibling | xpath(‘./preceding-sibling::*’) | 选取当前节点之前的兄弟节点 |
self | xpath(‘./self::*’) | 选取当前节点 |
六、功能函数
使用功能函数能够更好的进行模糊搜索
函数 | 用法 | 解释 |
---|---|---|
starts-with | xpath(‘//div[starts-with(@id,”ma”)]‘) | 选取id值以ma开头的div节点 |
contains | xpath(‘//div[contains(@id,”ma”)]‘) | 选取id值包含ma的div节点 |
and | xpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘) | 选取id值包含ma和in的div节点 |
text() | xpath(‘//div[contains(text(),”ma”)]‘) | 选取节点文本包含ma的div节点 |