Scrapy 选择器

最新推荐文章于 2024-10-18 00:00:00 发布

chongaishi2879

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量71

点赞数

文章标签： python shell 爬虫

原文链接：https://my.oschina.net/sii/blog/655853

版权

取自

http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/selectors.html#removing-namespaces

contains(): #限制选择

response.xpath('//a[contains(@href, "image")]/img/@src').extract()

/a[contains(@href, "image")]:    #返回 在 a标签下, href元素中包含"image"的字符串

re:test():    #正则表达式限制
sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
/li[re:test(@class, "item-\d$")]:    #返回 li标签下 class元素中,匹配正则的字符串

移除命名空间

在处理爬虫项目时，完全去掉命名空间而仅仅处理元素名字，写更多简单/实用的XPath会方便很多。你可以为此使用 Selector.remove_namespaces() 方法。

让我们来看一个例子，以Github博客的atom订阅来解释这个情况。

首先，我们使用想爬取的url来打开shell:

$ scrapy shell https://github.com/blog.atom
一旦进入shell，我们可以尝试选择所有的 <link> 对象，可以看到没有结果(因为Atom XML命名空间混淆了这些节点):
>>> response.xpath("//link")
[]

但一旦我们调用 Selector.remove_namespaces() 方法，所有的节点都可以直接通过他们的名字来访问:

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
 <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
 ...

转载于:https://my.oschina.net/sii/blog/655853