python：爬虫学习与教学（6）Scrapy中选择器用法

最新推荐文章于 2021-06-21 17:08:23 发布

花纵酒

最新推荐文章于 2021-06-21 17:08:23 发布

阅读量234

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/lm19770429/article/details/107136345

版权

python 专栏收录该内容

72 篇文章 1 订阅

订阅专栏

参考scrapy官方文档:https://docs.scrapy.org/en/latest/

爬取示例地址：https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

完整的HTML代码：
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

首先：
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

在命令行：

xpath方法：

In [4]: response.xpath("//title/text()")
Out[4]: [<Selector xpath='//title/text()' data='Example website'>]

获取文本：

In [5]: response.xpath("//title/text()").extract_first()
Out[5]: 'Example website'

In [7]: response.xpath("//title/text()").extract()
Out[7]: ['Example website']

也可以：

In [8]: response.xpath("//title/text()").get()
Out[8]: 'Example website'

In [10]: response.xpath("//title/text()").getall()
Out[10]: ['Example website']

css方法：

In [12]: response.css("title::text")
Out[12]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

.xpath() and .css() 方法返回一个 SelectorList 实例, which is a list of new selectors. This API can be used for quickly selecting nested data:

In [17]: response.css("img").xpath("@src")
Out[17]:
[<Selector xpath='@src' data='image1_thumb.jpg'>,
<Selector xpath='@src' data='image2_thumb.jpg'>,
<Selector xpath='@src' data='image3_thumb.jpg'>,
<Selector xpath='@src' data='image4_thumb.jpg'>,
<Selector xpath='@src' data='image5_thumb.jpg'>]

In [18]: response.css("img").xpath("@src").get()
Out[18]: 'image1_thumb.jpg'

xpath按属性查找：

In [20]: response.xpath("//div[@id='images']/a/text()")
Out[20]:
[<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 1 '>,
<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 2 '>,
<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 3 '>,
<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 4 '>,
<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 5 '>]

对应的css：

In [28]: response.css("div[id='images'] a").xpath("@href")
Out[28]:
[<Selector xpath='@href' data='image1.html'>,
<Selector xpath='@href' data='image2.html'>,
<Selector xpath='@href' data='image3.html'>,
<Selector xpath='@href' data='image4.html'>,
<Selector xpath='@href' data='image5.html'>]

In [32]: response.css("div[id='images'] a>img")
Out[32]:
[<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image1_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image2_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image3_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image4_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image5_thumb.jpg">'>]

In [33]: response.css("div[id='images']>a>img")
Out[33]:
[<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image1_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image2_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image3_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image4_thumb.jpg">'>,
<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image5_thumb.jpg">'>]

It returns None if no element was found:

In [38]: response.xpath('//div[@id="not-exists"]/text()').get() is None
Out[38]: True

设置找不到是的返回默认值

In [40]: response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
Out[40]: 'not-found'

Instead of using e.g. '@src' XPath it is possible to query for attributes using .attrib property of a Selector:

除了用xpath的[@属性名]，还可以用标签.attrib[属性名]

In [41]: [img.attrib['src'] for img in response.css('img')]
Out[41]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

注意区别：以下只返回一个

In [42]: response.css("img").attrib['src']
Out[42]: 'image1_thumb.jpg'

This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:

>>> response.css('base').attrib['href']
'http://example.com/'

通过属性获取：

In [43]: response.xpath("//base/@href").get()
Out[43]: 'http://example.com/'

In [44]: response.css("base::attr(href)").get()
Out[44]: 'http://example.com/'

In [48]: response.css("base").attrib["href"]
Out[48]: 'http://example.com/'

选择符合条件的属性的相关内容：

xpath:

In [50]: response.xpath("//a[contains(@href,'image')]/@href").getall()
Out[50]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [52]: response.xpath("//a[contains(@href,'image')]/img/@src").getall()
Out[52]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

css:

In [54]: response.css("a[href*=image]::attr(href)").getall()
Out[54]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [55]: response.css("a[href*=image] img::attr(src)").getall()
Out[55]:
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

CSS Selectors扩展：

to select text nodes, use ::text，如：title::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

*::text selects all descendant text nodes of the current selector context #选择当前selector所有后代节点文本

In [62]: response.css("#images *::text").getall()
Out[62]:
['\n   ',
'Name: My image 1 ',
'\n   ',
'Name: My image 2 ',
'\n   ',
'Name: My image 3 ',
'\n   ',
'Name: My image 4 ',
'\n   ',
'Name: My image 5 ',
'\n ']

迭代举例：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
...     print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

In [65]: [a.xpath("@href") for a in response.css("a")]
Out[65]:
[[<Selector xpath='@href' data='image1.html'>],
[<Selector xpath='@href' data='image2.html'>],
[<Selector xpath='@href' data='image3.html'>],
[<Selector xpath='@href' data='image4.html'>],
[<Selector xpath='@href' data='image5.html'>]]

In [66]: [a.xpath("@href").get() for a in response.css("a")]
Out[66]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [68]: [a.attrib["href"] for a in response.css("a")]
Out[68]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).getall()

>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']

>>> xp("(//li)[1]")
['<li>1</li>']

花纵酒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录