Chapter3.4/3.5 scrapy-css选择器及本章小结

最新推荐文章于 2024-08-06 16:04:58 发布

lee's work

最新推荐文章于 2024-08-06 16:04:58 发布

阅读量182

点赞数 1

分类专栏： scrapy学习文章标签： css 前端

本文链接：https://blog.csdn.net/qq_27608761/article/details/121028927

版权

scrapy学习专栏收录该内容

10 篇文章 1 订阅

订阅专栏

3.4　CSS选择器

CSS即层叠样式表，其选择器是一种用来确定HTML文档中某部分
位置的语言。
CSS选择器的语法比XPath更简单一些，但功能不如XPath强大。
实际上，当我们调用Selector对象的CSS方法时，在其内部会使用Python库cssselect将CSS选择器表达式翻译成XPath表达式，然后调用Selector对象的XPATH方法。
表3-2列出了CSS选择器的一些基本语法。

表3-2　CSS选择器

表达式	描述	例子
*	选中所元素	*
E	选中E元素	p
E1,E2	选中E1和E2 元素	div,pre
E1 E2	选中E1后代元素中的E2元素	div p
E1>E2	选中E1子元素中的E2元素	div>p
E1+E2	选中E1兄弟元素中的E2元素	p +strong
.CLASS	选中CLASS属性包含CLASS的元素	.info
#ID	选中id属性为ID的元素	#main
[ATTR]	选中包含ATTR属性的元素	[href]
[ATTR=VALUE]	选中包含ATTR属性且值为VALUE的元素	[method=post]
ATTR~=VALUE	选中包含ATTR属性且值包含VALUE的元素	[class~=clearfix]
E:nth-child(n) E:nth-last-child(n)	选中E元素，且该元素必须是其父元素的*（倒数）第n个子元素*	a:nth-child(1) a:nth-last-child(2)
E:first-child(n) E:last-child(n)	选中E元素，且该元素必须是其父元素的*（倒数）第一个子元素*	a:first-child a: last-child
E:empty	选中没有子元素的E 元素	div:empty
E::text	选中E元素的文本节点（Text Node）	p:: text

和学习XPath一样，通过一些例子展示CSS选择器的使用。
先创建一个HTML文档并构造一个HtmlResponse对象：

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> body ='''
 <html>
     <head>
        <base href='http://example.com/'/>
        <title>Example website</title>
     </head>
     
     <body>
         <a href='image0.html'>Name: Image 0 <br/><img src='image0.jpg'/> </a> 
         <div id='images-1' style="width: 1230px;">
             <a href='image1.html'>Name: Image 1 <br/><img src='image1.jpg'/> </a> 
             <a href='image2.html'>Name: Image 2 <br/><img src='image2.jpg'/> </a> 
             <a href='image3.html'>Name: Image 3 <br/><img src='image3.jpg'/> </a> 
         </div>

         <div id='images-2' class='small'>
	         <a href='image4.html'>Name: Image 4 <br/><img src='image4.jpg'/> </a> 
	         <a href='image5.html'>Name: Image 5 <br/><img src='image5.jpg'/> </a> 
         </div>
     </body>
 </html>
 '''
>>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf-8')
>>> response
<200 http://www.example.com>

●　E：选中E元素。

# 选中所有的img
>>> response.css('img')
>[<Selector xpath='descendant-or-self::img' data='<img src="image0.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image1.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image2.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image3.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image4.jpg">'>,
 <Selector xpath='descendant-or-self::img' data='<img src="image5.jpg">'>]

●　E1,E2：选中E1和E2元素。

# 选中所有base和title
>>> response.css('base,title')
>[<Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<base href="http://example.com/">'>,
 <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<title>Example website</title>'>]

●　E1 E2：选中E1后代元素中的E2元素。

# div 后代中的img
>>> response.css('div img')
>[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image1.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image2.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image3.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image4.jpg">'>,
 <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image5.jpg">'>]

●　E1>E2：选中E1子元素中的E2元素。

# body 子元素中的div
>>> response.css('body>div')
[<Selector xpath='descendant-or-self::body/div' data='<div id="images-1" style="width: 1230...'>,
 <Selector xpath='descendant-or-self::body/div' data='<div id="images-2" class="small">\n\t  ...'>]

●　[ATTR]：选中包含ATTR属性的元素。

#选中包含style属性的元素
>>> response.css('[style]')
>[<Selector xpath='descendant-or-self::*[@style]' data='<div id="images-1" style="width: 1230...'>]

●　[ATTR=VALUE]：选中包含ATTR属性且值为VALUE的元素。

# 选中属性id值为images-1的元素
>>> response.css('[id=images-1]')
>[<Selector xpath="descendant-or-self::*[@id = 'images-1']" data='<div id="images-1" style="width: 1230...'>]

●　E:nth-child(n)：选中E元素，且该元素必须是其父元素的第n个子元素。

#选中每个div的第一个a
>>> response.css('div>a:nth-child(1)')
[<Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='<a href="image1.html">Name: Image 1 <...'>,
 <Selector xpath='descendant-or-self::div/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html">Name: Image 4 <...'>]
# 选中第二个div的第一个a
>>> response.css('div:nth-child(2)>a:nth-child(1)')
>[<Selector xpath='descendant-or-self::div[count(preceding-sibling::*) = 2]/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html">Name: Image 4 <...'>]

●　E:first-child：选中E元素，该元素必须是其父元素的第一个子元素。
●　E:last-child：选中E元素，该元素必须是其父元素的倒数第一个子元素。

# 选中第一个div的最后一个a
>>> response.css('div:first-child>a:last-child')
>[<Selector xpath='descendant-or-self::div[count(following-sibling::*) = 0]/a[count(preceding-sibling::*) = 0]' data='<a href="image4.html">Name: Image 4 <...'>]

●　E::text：选中E元素的文本节点。

选中所有a的文本

>>> sel = response.css('a::text')
>>> sel
>[<Selector xpath='descendant-or-self::a/text()' data='Name: Image 0 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 1 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 2 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 3 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 4 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>,
 <Selector xpath='descendant-or-self::a/text()' data='Name: Image 5 '>,
 <Selector xpath='descendant-or-self::a/text()' data=' '>]
 >>> sel.extract()
 > ['Name: Image 0 ',
 ' ',
 'Name: Image 1 ',
 ' ',
 'Name: Image 2 ',
 ' ',
 'Name: Image 3 ',
 ' ',
 'Name: Image 4 ',
 ' ',
 'Name: Image 5 ',