和xpath选择器比起来,感觉CSS选择器容易一些,跟写.css时方法基本一样,就是在获取内容时和xpath不同,这里需要注意一下.
这里介绍如何用css选择器提取出一篇文章的数据
提取的数据跟xpath那篇文章内容相同
之前xpath中我们获取元素是通过.entry-header h1::text
,如果是属性则用.entry-header a::attr(href)
介绍一个常用的函数extract_first()
相当于extract()[0]
,但是extract()[0]当数组没有元素时,也就是没有获取到数据时会出错,所以用extract_first()
方法,也可以加上需要返回的内容,比如空,extract_first("")
title = response.css(".entry-header h1::text").extract_first()
#p可以不加
create_date = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace('·','').strip()
#获取点赞数
praise_nums = response.css('#110287votetotal::text').extract()[0]
#获取收藏数
fav_nums = response.css('.btn-bluet-bigger.href-style.bookmark-btn .register-user-only::text ').extract()[0].strip()
match_re = re.match('.*?(\d+).*',fav_nums)
if match_re:
#获取收藏数
fav_nums = match_re.group(1)
comment_nums = response.css('.btn-bluet-bigger.href-style.hide-on-480::text').extract()[0].strip()
match_re = re.match('.*?(\d+).*',fav_nums)
if match_re:
comment_nums = match_re.group(1)
tag_list = response.css('.entry-meta-hide-on-mobile a::text').extract()
content = response.css('div.entry').extract()[0]
tag_list = [element for element in tag_list if not element.strip().endswith('评论')]
tag = ','.join(tag_list)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
当我们要选择的属性名字有多个时比如下面:
这市在选择时应该用
post_urls = response.css('#archive .post.floated-thumb .post-thumb a::attr(href)').extract()
- 1
也就是.post.floated-thumb应该连起来,或者只写.floated-thumb
完整代码(准)
def parse_detail(self, response):
title = response.css(".entry-header h1::text").extract_first()
create_date = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()
praise_nums = response.css(".vote-post-up h10::text").extract()[0]
fav_nums = response.css(".bookmark-btn::text").extract()[0]
match_re = re.match(".*?(\d+).*", fav_nums)
if match_re:
fav_nums = int(match_re.group(1))
else:
fav_nums = 0
comment_nums = response.css(<span class="hljs-string">"a[href='#article-comment'] span::text"</span>).extract()[<span class="hljs-number">0</span>]
match_re = re.match(<span class="hljs-string">".*?(\d+).*"</span>, comment_nums)
<span class="hljs-keyword">if</span> match_re:
comment_nums = int(match_re.group(<span class="hljs-number">1</span>))
<span class="hljs-keyword">else</span>:
comment_nums = <span class="hljs-number">0</span>
content = response.css(<span class="hljs-string">"div.entry"</span>).extract()[<span class="hljs-number">0</span>]
tag_list = response.css(<span class="hljs-string">"p.entry-meta-hide-on-mobile a::text"</span>).extract()
tag_list = [element <span class="hljs-keyword">for</span> element <span class="hljs-keyword">in</span> tag_list <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> element.strip().endswith(<span class="hljs-string">"评论"</span>)]
tags = <span class="hljs-string">","</span>.join(tag_list)
<span class="hljs-keyword">pass</span></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li><li style="color: rgb(153, 153, 153);">18</li><li style="color: rgb(153, 153, 153);">19</li><li style="color: rgb(153, 153, 153);">20</li><li style="color: rgb(153, 153, 153);">21</li><li style="color: rgb(153, 153, 153);">22</li><li style="color: rgb(153, 153, 153);">23</li><li style="color: rgb(153, 153, 153);">24</li></ul></pre> </div>
<link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/markdown_views-ea0013b516.css">
</div>
</article>