scrapy2

最新推荐文章于 2024-07-31 14:30:10 发布

chouhong9972

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量104

点赞数

文章标签： python

原文链接：https://my.oschina.net/u/3411375/blog/877509

版权

Selectors

BeautifulSoup和lxml是两种常用的网页解析器，如果有时间需要学习一下。

scrapy通过XPath和CSS表达式来进行页面的解析，scrapy的选择器是构建在lxml库上的。

Constructing Selectors

scrapy selector 可以对Text和TestResponse构造selector实例。

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

对Test构造，这种方式方便对某个需要解析页面进行解析测试。

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

response这种形式是实际程序中经常使用的方式，

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

Using Selectors

例子：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

scrapy提供了两种快捷方式：response.css();response.xpath(),这两种方式可以直接开发中使用。

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

.css和.xpath返回一个selectorlist实例，他们也可以嵌套使用

>>> response.xpath('//title/text()').extract()
[u'Example website']

使用extract()抽取信息

>>> response.xpath('//title/text()').extract()
[u'Example website']

使用extract_first()抽取第一个元素

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

如果返回None,则是没有发现元素

CSS选择器可以使用CSS3伪元素来选择文字或属性（不知道什么是CSS3伪元素，记下）

>>> response.css('title::text').extract()
[u'Example website']

使用两种方式获得属性信息的一些应用举例

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

Nesting Selectors（嵌套使用选择器）

因为返回的是Selectorlist实例

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

Using selectors with regular repressions

选择器有.re()方法，用来使用正则表达式来选择数据。

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

我觉得这些内容基本可以满足简单页面解析的需求，还有一些更深入的内容可以参考原文档。

转载于:https://my.oschina.net/u/3411375/blog/877509

chouhong9972

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy2

Selectors BeautifulSoup和lxml是两种常用的网页解析器，如果有时间需要学习一下。 scrapy通过XPath和CSS表达式来进行页面的解析，scrapy的选择器是构建在lxml库上的。 Constructing Selectors scrapy selector...
复制链接

扫一扫