Scrapy Selectors 选择器

最新推荐文章于 2024-10-18 21:45:36 发布

花阴偷移

最新推荐文章于 2024-10-18 21:45:36 发布

阅读量55

点赞数

文章标签： scrapy tensorflow 人工智能 python 深度学习

本文链接：https://blog.csdn.net/weixin_43394129/article/details/132487345

版权

1.介绍

　　当抓取网页时，需要执行最常见的任务是从html源中提取数据，有几个库可以实现这一点，例如：

　　1)BeautifulSoup是python程序员中非常流行的web抓取库,能很好地处理不良标记，但速度慢.

　　2)lxml是一个xml解析库(也解析html)，lxml不是python标准库的一部分

　　Scrapy 有自己的数据提取机制。它们被称为选择器，由xPath或css表过式指定的html文档的某部分。

　　Scrapy Selectors是parsel库的一个薄包装器；这个包装器的目的是提供与 Scrapy Response 对象的更好集成。parsel是一个独立的网页抓取库，可以在没有 Scrapy 的情况下使用。它在底层使用lxml库，并在 lxml API 之上实现了一个简单的 API。这意味着 Scrapy 选择器在速度和解析精度上与 lxml 非常相似。

2. 使用选择器

　　响应对象在属性上公开了一个Selector实例 : .selector

response.selector.xpath('//span/text()').get()

　　使用 XPath 和 CSS 查询响应是如此普遍，以至于响应包括另外两个快捷方式：response.xpath()和response.css()：

response.xpath('//span/text()').get()
'good'
response.css('span::text').get()
'good'

　　如果需要，可以从Selector直接使用。从文本构造:

from scrapy.selector import Selector
body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').get()

　　从响应构建HtmlResponse是TextResponse子类之一：

from scrapy.selector import Selector
from scrapy.http import HtmlResponse
response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').get()

3. 交互式使用选择器

　　为了解释如何使用选择器，我们将使用(提供交互式测试)和位于 Scrapy 文档服务器中的示例页面：　　　 Scrapy shell

PS F:\python_work\scrapy_sample> scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

In [3]: response.css("title::text")[0].get()
Out[3]: 'Example website'

In [4]: response.xpath("//title/text()")[0].get()
Out[4]: 'Example website'

#先用css定位到标签上，再通过xpath获取标签属性
In [9]: response.css("img").xpath('@src').getall()
Out[9]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

#获取第一个a标签的内容
In [10]: response.xpath("//div[@id='images']/a/text()").get()
Out[10]: 'Name: My image 1 '

#如果未找到元素，则返回True
In [14]: response.xpath("//div[@id='images1']/a/text()").get()  is None       
Out[14]: True

# img in response.css("img") 获取所有img
# [img.attrib['src'] for img  从img中获取src
In [16]: [img.attrib['src'] for img in response.css("img")]
Out[16]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

#.attrib也可以直接在 SelectorList 上使用；它返回第一个匹配元素的属性
In [17]: response.css("img").attrib['src']
Out[17]: 'image1_thumb.jpg'

#下面是三种方式获取base标签的href属性
In [18]: response.css('base').attrib['href']
Out[18]: 'http://example.com/'

In [19]: response.xpath('//base/@href').get()
Out[19]: 'http://example.com/'

In [20]: response.css('base::attr(href)').get()
Out[20]: 'http://example.com/'

#使用xpath来获取所有图片， href属性中包含image字符的文本的
In [23]: response.xpath('//a[contains(@href,"image")]/img/@src').getall()     
Out[23]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

#这个也一样， href属性中包含image字符的文本的
In [24]: response.css('a[href*=image] img::attr(src)').getall()
Out[24]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

4.css 选择器的扩展

　　根据 W3C 标准，CSS 选择器不支持选择文本节点或属性值。但是在网络抓取上下文中选择这些非常重要，以至于 Scrapy (parsel) 实现了几个非标准的伪元素：

　　要选择文本节点，使用::text 如response.css('span::text').get()

　　要选择属性的值，使用::attr(name), 其中name是属性的名称, 如response.css('img::attr(src)').get()

#*::text选择当前选择器上下文的所有后代文本节点
In [25]: response.css('#images *::text').getall()
Out[25]: 
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

5. 嵌套选择器

In [26]: links = response.xpath('//a[contains(@href, "image")]')

In [27]: for index,link in enumerate(links):
    ...:     href_xpath=link.xpath('@href').get()
    ...:     print ('{index}points to url{href_xpath}')

6.extract() 和 extract_first()

　　以前使用extract() 和 extract_first()，现在使用.get()和 .getall()，只不过以前的没有弃用

7.相对XPath

divs = response.xpath('//div')

# this is wrong - gets all <p> from the whole document(这是从整个文档获取的p)
for p in divs.xpath('//p'):  
    print(p.get())

#这是一个正确的示例，从divs中获取所有p
for p in divs.xpath('.//p'): 
    print(p.get())