说明:仅供学习使用,请勿用于非法用途,若有侵权,请联系博主删除
作者:zhu6201976
1.response.xpath().get(default=None)
get方法有默认参数None,未提取到默认返回None,否则返回default值,源码:
def get(self, default=None):
"""
Return the result of ``.get()`` for the first element in this list.
If the list is empty, return the default value.
"""
for x in self:
return x.get()
else:
return default
示例:
response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
2.response.xpath().re() or response.xpath().re_first()
selector对象也可以使用正则表达式子,返回list 或 str 。
response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']
response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'
3.xpath() 语法中使用count()计数
如:查找某个ul标签下有9个li子标签的 ul 标签 xpath语法 //ul[count(li)=9]
response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'
4.在scrapy框架中,无须通过lxml解析xpath
>>> from scrapy import Selector
>>> doc = u"""
... <div>
... <ul>
... <li class="item-0"><a href="link1.html">first item</a></li>
... <li class="item-1"><a href="link2.html">second item</a></li>
... <li class="item-inactive"><a href="link3.html">third item</a></li>
... <li class="item-1"><a href="link4.html">fourth item</a></li>
... <li class="item-0"><a href="link5.html">fifth item</a></li>
... </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']
Regular expressions
The test()
function, for example, can prove quite useful when XPath’s starts-with()
or contains()
are not sufficient.
5.xpath其他用法
from scrapy import Selector
str1 = """
<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>
"""
s = Selector(text=str1, type='html')
# ret1,ret2结果等价
ret1 = s.xpath('//p[has-class("foo")]').getall() # 有class属性且值为foo的p标签
ret2 = s.xpath('//p[contains(@class,"foo")]').getall() # class属性包含foo的p标签
ret3 = s.xpath('//p[has-class("foo", "bar-baz")]').getall() # foo 且 bar-baz
print(ret1, ret2, ret3)