scrapy 官方文档笔记

最新推荐文章于 2022-11-06 09:21:55 发布

zhu6201976

最新推荐文章于 2022-11-06 09:21:55 发布

阅读量7.5w

点赞数 1

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/zhu6201976/article/details/106607826

版权

Python爬虫专栏收录该内容

25 篇文章 8 订阅

订阅专栏

说明：仅供学习使用，请勿用于非法用途，若有侵权，请联系博主删除

作者：zhu6201976

博客：https://blog.csdn.net/zhu6201976

1.response.xpath().get(default=None)

get方法有默认参数None，未提取到默认返回None，否则返回default值，源码：

    def get(self, default=None):
        """
        Return the result of ``.get()`` for the first element in this list.
        If the list is empty, return the default value.
        """
        for x in self:
            return x.get()
        else:
            return default

示例：

response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')

'not-found'

2.response.xpath().re() or response.xpath().re_first()

selector对象也可以使用正则表达式子，返回list 或 str 。

response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

3.xpath() 语法中使用count()计数

如：查找某个ul标签下有9个li子标签的 ul 标签 xpath语法 //ul[count(li)=9]

response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

4.在scrapy框架中，无须通过lxml解析xpath

>>> from scrapy import Selector
>>> doc = u"""
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']

Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with() or contains() are not sufficient.

5.xpath其他用法

from scrapy import Selector

str1 = """
<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>
"""

s = Selector(text=str1, type='html')
# ret1,ret2结果等价
ret1 = s.xpath('//p[has-class("foo")]').getall()  # 有class属性且值为foo的p标签
ret2 = s.xpath('//p[contains(@class,"foo")]').getall()  # class属性包含foo的p标签
ret3 = s.xpath('//p[has-class("foo", "bar-baz")]').getall()  # foo 且 bar-baz
print(ret1, ret2, ret3)