python:爬虫学习与教学(6)Scrapy中选择器用法

参考scrapy官方文档:https://docs.scrapy.org/en/latest/

爬取示例地址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

完整的HTML代码:
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
首先:
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

在命令行:

xpath方法:

In [4]: response.xpath("//title/text()")
Out[4]: [<Selector xpath='//title/text()' data='Example website'>]

获取文本:

In [5]: response.xpath("//title/text()").extract_first()
Out[5]: 'Example website'

In [7]: response.xpath("//title/text()").extract()
Out[7]: ['Example website']

也可以:

In [8]: response.xpath("//title/text()").get()
Out[8]: 'Example website'

In [10]: response.xpath("//title/text()").getall()
Out[10]: ['Example website']

css方法:

In [12]: response.css("title::text")
Out[12]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

 



.xpath() and .css() 方法返回一个 SelectorList 实例, which is a list of new selectors. This API can be used for quickly selecting nested data:

In [17]: response.css("img").xpath("@src")
Out[17]:
[<Selector xpath='@src' data='image1_thumb.jpg'>,
 <Selector xpath='@src' data='image2_thumb.jpg'>,
 <Selector xpath='@src' data='image3_thumb.jpg'>,
 <Selector xpath='@src' data='image4_thumb.jpg'>,
 <Selector xpath='@src' data='image5_thumb.jpg'>]

In [18]: response.css("img").xpath("@src").get()
Out[18]: 'image1_thumb.jpg'

xpath按属性查找:

In [20]: response.xpath("//div[@id='images']/a/text()")
Out[20]:
[<Selector xpath="//div[@id='images']/a/text()" data='Name: My image 1 '>,
 <Selector xpath="//div[@id='images']/a/text()" data='Name: My image 2 '>,
 <Selector xpath="//div[@id='images']/a/text()" data='Name: My image 3 '>,
 <Selector xpath="//div[@id='images']/a/text()" data='Name: My image 4 '>,
 <Selector xpath="//div[@id='images']/a/text()" data='Name: My image 5 '>]

对应的css:

In [28]: response.css("div[id='images'] a").xpath("@href")
Out[28]:
[<Selector xpath='@href' data='image1.html'>,
 <Selector xpath='@href' data='image2.html'>,
 <Selector xpath='@href' data='image3.html'>,
 <Selector xpath='@href' data='image4.html'>,
 <Selector xpath='@href' data='image5.html'>]

In [32]: response.css("div[id='images'] a>img")
Out[32]:
[<Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image1_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image2_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image3_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image4_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/descendant-or-self::*/a/img" data='<img src="image5_thumb.jpg">'>]

In [33]: response.css("div[id='images']>a>img")
Out[33]:
[<Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image1_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image2_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image3_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image4_thumb.jpg">'>,
 <Selector xpath="descendant-or-self::div[@id = 'images']/a/img" data='<img src="image5_thumb.jpg">'>]


It returns None if no element was found:

In [38]: response.xpath('//div[@id="not-exists"]/text()').get() is None
Out[38]: True

设置找不到是的返回默认值

In [40]: response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
Out[40]: 'not-found'

Instead of using e.g. '@src' XPath it is possible to query for attributes using .attrib property of a Selector:

除了用xpath的[@属性名],还可以用标签.attrib[属性名]

In [41]: [img.attrib['src'] for img in response.css('img')]
Out[41]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

注意区别:以下只返回一个

In [42]: response.css("img").attrib['src']
Out[42]: 'image1_thumb.jpg'

This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:

>>> response.css('base').attrib['href']
'http://example.com/'

通过属性获取:

In [43]: response.xpath("//base/@href").get()
Out[43]: 'http://example.com/'

In [44]: response.css("base::attr(href)").get()
Out[44]: 'http://example.com/'

In [48]: response.css("base").attrib["href"]
Out[48]: 'http://example.com/'

选择符合条件的属性的相关内容:

xpath:

In [50]: response.xpath("//a[contains(@href,'image')]/@href").getall()
Out[50]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [52]: response.xpath("//a[contains(@href,'image')]/img/@src").getall()
Out[52]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

css:

In [54]: response.css("a[href*=image]::attr(href)").getall()
Out[54]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [55]: response.css("a[href*=image] img::attr(src)").getall()
Out[55]:
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']


CSS Selectors扩展:

  • to select text nodes, use ::text,如:title::text

  • to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

*::text selects all descendant text nodes of the current selector context   #选择当前selector所有后代节点文本

In [62]: response.css("#images *::text").getall()
Out[62]:
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']


迭代举例:

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
...     print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

In [65]: [a.xpath("@href") for a in response.css("a")]
Out[65]:
[[<Selector xpath='@href' data='image1.html'>],
 [<Selector xpath='@href' data='image2.html'>],
 [<Selector xpath='@href' data='image3.html'>],
 [<Selector xpath='@href' data='image4.html'>],
 [<Selector xpath='@href' data='image5.html'>]]

In [66]: [a.xpath("@href").get() for a in response.css("a")]
Out[66]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

In [68]: [a.attrib["href"] for a in response.css("a")]
Out[68]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']


Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).getall()
>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']
>>> xp("(//li)[1]")
['<li>1</li>']

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值